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ABSTRACT 


The  development  of  a  digital  encoding  system  for  speech  and  audio  signals 
is  described.  The  system  is  designed  to  exploit  the  limited  detection  ability 
of  the  auditory  system.  Existing  digital  encoders  are  examined.  Relevant 
psychoacoustic  experiments  are  reviewed.  Where  the  literature  is  lacking, 
a  simple  masking  experiment  is  performed  and  the  results  reported.  The 
design  of  the  encoding  system  and  specifications  of  system  parameters  are 
then  developed  from  the  perceptual  requirements  and  digital  signal  process¬ 
ing  techniques. 

The  encoder  is  a  multi-channel  system,  each  channel  approximately  of  crit¬ 
ical  bandwidth.  The  input  signal  is  filtered  via  the  quadrature  mirror  filter 
technique.  An  extensive  development  of  this  technique  is  presented.  Chan¬ 
nels  are  quantized  with  an  adaptive  PCM  scheme. 

The  encoder  is  evaluated  for  speech  and  audio  signal  inputs.  For  4.1 -kHz 
bandwidth  speech,  the  differential  threshold  of  encoding  degradation  oc¬ 
curs  at  a  bit  rate  of  34.4  kbps.  At  16  kbps,  the  encoder  produces  toll- 
quality  speech  output.  Audio  signals  of  15-kHz  bandwidth  can  be  encoded 
at  123.8  kbps  without  audible  degradation. 


This  report  is  based  on  a  thesis  submitted  to  the  Department  of  Electrical 
Engineering  and  Computer  Science  at  the  Massachusetts  Institute  of  Tech¬ 
nology  on  4  May  1979  in  partial  fulfillment  of  the  requirements  for  the  de¬ 
gree  of  Doctor  of  Philosophy. 
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DIGITAL  ENCODING  OF  SPEECH  AND  AUDIO  SIGNALS 


BASED  ON  THE  PERCEPTUAL  REQUIREMENTS  OF  THE  AUDITORY  SYSTEM 
I.  INTRODUCTION 

Digital  techniques  for  the  processing,  transmission,  and  storage  of  speech  and  other  audio 
signals  have  become  increasingly  important  in  the  past  several  years.  Advantages  of  the  digital 
domain  include  flexible  processing  not  previously  possible,  increased  transmission  reliability, 
and  error-resistant  storage.  Implicit  in  the  conversion  of  a  continuous  audio  waveform  into  a 
digital  bit  stream  are  degradations  due  to  the  nonlinearity  of  the  process.  While  a  sentence  can 
be  transmitted  simply  by  coding  exactly  the  letters  of  each  word,  there  are  an  infinite  number 
of  waveforms  that  could  represent  that  spoken  sentence.  Increasing  the  amount  of  information 
in  the  digital  signal  by  increasing  the  digital  bit  rate  can  decrease  the  error  in  the  digital  ap¬ 
proximation  of  the  sentence  waveform,  but  increase  the  costs  of  transmission  and  storage  by 
necessitating  the  use  of  a  channel  of  higher  capacity. 

Audio  signals  differ  from  other  signals  since  the  intent  is  to  communicate  with  a  person. 

The  performance  of  an  audio  system  can  not  be  measured  by  a  simple  root-mean- square  (RMS) 
error  measurement.  Rather,  it  is  the  complex  processing  of  the  auditory  system  that  deter¬ 
mines  its  quality.  Indeed,  an  audio  system  that  compares  favorably  to  another  system  in  a  tra¬ 
ditional  signal-to-noise  ratio  (SNR)  error  measurement,  the  ratio  of  the  mean  square  signal 
level  to  the  mean  square  noise,  may  be  judged  as  annoying  to  listen  to  and,  therefore,  of  lower 
quality.  An  audio  system  with  additive  white  noise  with  a  SNR  of  20  dB  is  generally  preferred 
to  a  system  with  10  percent  harmonic  distortion.  For  speech  systems,  even  listener  preference 
does  not  correlate  well  with  intelligibility.  For  example,  adding  dither  to  a  3-bit-per-sample 
linear  pulse  code  modulation  (PCM)  system  does  not  affect  the  SNR  of  the  system.  The  dithered 

system,  however,  is  of  lower  intelligibility,  but  is  preferred  by  listeners  over  the  undithered 
1 

PCM  coder.  To  design  and  evaluate  an  audio  system,  it  is  necessary  to  have  some  understand¬ 
ing  of  the  functioning  of  the  auditory  system,  its  capabilities  and  limitations.  With  that  under¬ 
standing,  it  may  be  possible  to  identify  subjective  quality  variables  and  relate  them  to  objective 
physical  quantities  in  the  stimuli. 

In  this  report,  an  encoding  system  is  designed  to  exploit  the  limitation  of  the  auditory  system 
imposed  by  masking  characteristics,  the  ability  of  one  sound  to  inhibit  the  perception  of  another 
sound.  The  system  is  designed  so  that  the  error  noise  due  to  quantization  is  masked  by  the  audio 
signal  being  encoded.  The  test  of  the  system  is  whether  a  listener  can  perceive  any  differences 
between  the  original  signal  and  the  signal  that  has  been  processed  by  the  encoding  system. 

A.  Historical  Development  of  the  Problem 

Digital  encoding  of  audio  signals  can  be  grouped  into  three  types  of  systems  by  their  quality 
and  applications.  The  highest  quality  coders  have  been  developed  for  use  with  voice  and  music 
for  the  radio  broadcast  and  record  industries.  These  systems  are  characterized  by  wide  signal 
bandwidths  of  12  to  20  kHz  and  large  SNRs  of  50  to  100  dB.  Bit  rates  of  up  to  500  kilobits  per 
second  (kbps)  are  common  in  these  high-quality  systems.  The  fidelity  criterion  for  these  sys¬ 
tems  is  that  the  listener  will  perceive  little  or  no  degradation  of  the  input  signal. 

Digital  encoders  for  speech  tend  to  emphasize  lower  bit  rates  since  the  objective  is  often 
a  low-cost  communication  system  as  for  telephone  communications.  Degradations  are  permitted 
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as  long  as  intelligibility  is  high  and  it  is  not  annoying  to  listen  to  the  sound  for  reasonable  lengths 

of  time.  For  commercial  telephone  applications,  a  64-kbps  system  is  commonly  used.  Although 

2 

coders  with  comparable  quality  such  as  adaptive  differential  pulse  code  modulation  (ADPCM) 
have  been  developed  using  rates  lower  than  40  kbps  (Chapter  II-D-1),  the  implementation  costs 
of  these  algorithms  often  outweigh  the  savings  due  to  the  use  of  lower  capacity  channels. 

The  third  area  of  development  has  been  in  very  low-rate  speech  communication  systems 
needing  channel  capacities  as  low  as  2.4  kbps.  Invariably,  this  bit-rate  reduction  is  achieved  by 
modeling  the  production  of  the  input  speech  by  a  slowly  time-varying  vocal-tract  system.  In¬ 
formation  to  spei  'fy  the  parameters  of  that  production  system  are  encoded.  Although  intelligi¬ 
bility  is  high  for  speech  input  with  a  low  noise  background,  signals  that  do  not  fit  the  model,  such 
as  nonspeech  and  speech  in  a  noisy  environment  are  reproduced  poorly. 

For  many  applications,  it  is  not  yet  economically  feasible  to  use  digital  transmission  and 
storage  methods  because  of  the  cost  of  the  high- capacity  channels  required.  Encoding  systems 
that  would  permit  the  use  of  lower  capacity  channels  while  maintaining  the  necessary  signal 
quality  would  open  new  applications  areas  for  digital  techniques. 

B.  Scope  of  this  Report 

The  objective  of  this  report  is  to  relate  the  results  of  psychoacoustic  research  to  the  develop¬ 
ment  of  a  digital  encoding  system.  By  using  the  limitations  of  the  auditory  system,  the  system 
is  made  to  be  efficient,  using  only  the  bit  rate  necessary  to  maintain  its  quality. 

The  encoder  is  designed  so  that  the  degradations  introduced  through  its  processing  are  not 
audible  when  presented  along  with  the  audio  signal.  The  encoder,  based  on  the  characteristics 
of  the  auditory  system,  should  work  well  with  speech,  music,  or  any  other  audio  signal. 

The  report  is  divided  into  several  parts.  Existing  digital  encoders  are  examined  and  rele¬ 
vant  psychoacoustic  experiments  are  reviewed.  Where  the  literature  is  lacking,  simple  experi¬ 
ments  are  performed,  and  the  results  of  these  experiments  are  reported.  The  design  of  the  sys¬ 
tem  is  then  developed  from  the  perceptual  requirements  and  digital  signal- processing  techniques. 
The  system  is  evaluated  with  high-quality  speech  and  audio  signals  to  determine  parameters  for 
broadcast-quality  transmission  and  archival-quality  storage.  Experiments  are  performed  to  find 
the  minimum  bit  rate  such  that  the  processing  of  the  system  is  not  noticeable  to  the  average  lis¬ 
tener.  The  system  parameters  are  then  set  for  lower  bit  rates  and  the  encoder  compared  to 
other  encoders  for  possible  use  in  basic  speech  communication  systems. 
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II.  REVIEW  OF  EXISTING  DIGITAL  ENCODERS 


A.  Introduction 

Degradation  of  analog  signal  quality  resulting  from  processing,  storage,  and  transmission 
of  speech  and  audio  signals  is  often  a  major  obstacle  in  the  implementation  of  such  systems. 

To  alleviate  this  problem,  digital  techniques  are  being  used  increasingly  for  high-quality  speech 
and  audio  systems.  For  voice  communication  over  phone  lines  and  satellite  links,  digital  tech¬ 
niques  simplify  the  multiplexing  of  several  conversations  and  the  protection  from  noise.  As  the 
technology  has  progressed,  many  algorithms  for  digital  encoding  have  emerged. 

Digitization  requires  two  processes,  sampling  of  the  signal  at  discrete  instants  of  time  and 
quantization  of  the  signal  samples  to  a  discrete  number  of  bits  of  information.  (This  is  not 
strictly  true  for  some  low-rate  vocoders  that  try  to  model  the  speech  production  process.  It  is 
a  valid  assumption  for  the  class  of  encoders  and  quantizers  relevant  to  this  research.)  It  is 
sometimes  convenient  to  consider  sampling  and  quantization  to  be  separate  processes  even 
though  they  may  be  implemented  together.  The  ordering  of  these  processes  is  not  important  and 
is  chosen  to  simplify  the  analysis  in  the  Chapter . 

For  a  band-limited  signal,  the  process  of  sampling  can  be  accomplished  without  any  loss 
of  information.  By  sampling  at  the  Nyquist  rate,  a  rate  of  twice  the  highest  frequency  present 
in  the  continuous-time  signal,  the  sampling  is  a  simply  reversible  process.  Quantization,  how¬ 
ever,  introduces  error.  It  is  the  audibility  of  this  error  that  encoding  systems  try  to  minimize. 

For  speech  signals,  voiced  segments  have  very  little  energy  above  4  kHz.  Unvoiced  speech 
sounds,  however,  have  significant  energy  at  frequencies  greater  than  8  kHz.  Sampling  at  ap¬ 
proximately  8  kHz  for  a  resultant  4 -kHz  signal  bandwidth  is  typical  for  telephone  and  similar 
communications.  Very  little  loss  of  intelligibility  is  evidenced  at  that  sampling  rate.  Larger 
bandwidths  are  used  for  higher -quality  speech  systems  and  for  music.  A  frequency  response 
to  15  kHz  and  higher  is  typical  of  broadcast  quality  audio  encoders.  Music  can  usually  be  fil¬ 
tered  to  15  kHz  with  little  or  no  audible  degradation. 

The  most  basic  encoding  system  is  pulse  code  modulation  (PC M).  Among  its  advantages 
are  simplicity,  direct  representation  as  binary  numbers  for  digital  storage,  and  easy  implemen¬ 
tation  of  the  corresponding  analog-to-digital  (A/D)  and  digital-to-analog  (D/A)  converters.  Most 
other  systems  are  derived  from  PCM. 

B.  Instantaneous  Quantization 

Instantaneous  quantizers  are  characterized  by  memoryless  input-output  relations.  The  re¬ 
lation  is  nonlinear  by  necessity  as  the  output  is  only  permitted  to  take  on  a  finite  number  of 
values. 

In  a  continuous-time  system,  the  output  of  a  memoryless  nonlinear  process  is  periodic  if 
the  input  is  periodic.  The  output  contains  energy  only  at  multiples  of  the  fundamental  frequency, 
the  frequency  of  the  periodicity.  The  error  is  harmonic  distortion  for  sinusoidal  inputs,  and 
harmonic  and  intermodulation  distortion  for  inputs  that  are  sums  of  sinusoids.  If  a  band -limited 
signal  is  quantized,  the  distortion  products  are  not  restricted  to  that  band.  When  the  resulting 
signal  is  sampled  at  a  rate  commensurate  with  the  bandwidth  of  the  unquantized  signal,  the  com¬ 
ponents  of  the  error  signal  outside  of  the  original  frequency  band  are  aliased  into  that  band  at 
frequencies  that  are  not  necessarily  related  to  the  original  signal.  This  error  may  then  sound 
like  white  noise  or  harmonic  distortion,  depending  on  the  exact  quantization  system  used. 
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FREQUENCY/ INPUT  SIGNAL  BANDWIDTH 


1.  Pulse  Code  Modulation 

PCM  quantization  divides  the  input  amplitude  range  into  a  number  of  equally  spaced  inter¬ 
vals.  The  number  of  intervals  is  usually  a  power  of  2  so  that  the  interval  into  which  a  sample 
falls  may  be  coded  into  an  integer  number  of  binary  bits.  A  sample  is  approximated  by  the 
value  in  the  middle  of  the  interval  into  which  it  falls.  This  input-output  relation  is  shown  in 
Fig.II-1. 

As  the  number  of  levels  increase,  the  bandwidth  of  the  quantized  signal  increases.  Fig¬ 
ure  II -2  shows  the  bandwidth  of  distortion  for  the  quantization  of  a  unit  bandwidth,  unit -power 
random  noise.  Note  that  for  8-bit  quantization  (256  levels)  the  spectrum  of  the  error  has  de¬ 
creased  only  10  dB  in  power  from  the  low-frequency  level  by  a  frequency  of  300  times  the  input 
bandwidth.  Using  this  graph,  the  perception  of  the  quantization  error  for  PCM  can  be  explained. 

One-bit  PCM,  2  levels,  is  a  hard-limiter  transforming  an  input  signal  into  a  pulse  train 
with  time-varying  duty  cycle.  Since  the  distortion  products  present  in  the  hard-limited  signal 
are  predominantly  within  the  original  signal  bandwidth,  much  of  the  error  energy  will  not  be 
aliased  by  sampling.  The  error  is  perceived  as  distortion  of  the  input. 

Ten-bit  PCM,  1024  levels,  represents  a  good  approximation  to  the  input.  The  error  is 
very  wideband  as  implied  by  Fig.II-2.  When  sampled,  there  is  little  correlation  of  the  error 
with  the  input  and  it  sounds  like  white  noise. 

As  the  number  of  bits  increase,  the  level  of  the  error  decreases  logarithmically.  For  an 
N-bit  quantizer,  the  SNR  for  the  maximum-level  sinusoid  input  that  will  not  overload  the  quan¬ 
tizer's  range  is  in  Eq.(II-l): 

SNR  =  6.02  N  +  1.76  dB  .  (n-i) 

If  a  signal  such  as  speech  is  to  be  quantized,  headroom  must  be  left  to  prevent  clipping  on 
peaks  larger  than  the  average  level.  Assuming  a  Laplacian  density  for  the  amplitude  of  the 
speech  samples,  only  0.35  percent  of  the  samples  will  be  larger  in  magnitude  than  four  standard 
deviations.  Setting  this  equal  to  the  maximum  quantizer  level,  i.e.,  allowing  for  peak  samples 
12  dB  greater  than  the  RMS  signal  amplitude,  the  SNR  is  now  9  dB  less  than  in  Eq.(II-l)t  as  in 
Eq.(II-2):5 

SNR  =  6.02  N  -  7.27  dB  .  (H-2) 

2.  Dither 

The  error  for  a  PCM  quantization  system  is  normally  assumed  to  be  statistically  indepen¬ 
dent  of  the  input  waveform.  This  independence  is  caused  by  the  aliasing  of  high-frequency  com¬ 
ponents  of  the  quantized  signal  into  the  signal  band.  For  a  small  number  of  levels,  this  assump¬ 
tion  is  not  valid.  As  discussed  in  Section  II-B-1,  the  quantized  and  sampled  signal  will  sound 
like  the  result  of  harmonic  distortion  when  just  a  couple  of  bits  are  used.  To  remove  the  cor¬ 
relation  of  the  error  to  the  input,  the  scheme  shown  in  Fig.II-3  can  be  used.  White  noise  with 
a  uniform  probability  density  with  width  equal  to  one  quantization  interval  is  added  to  the  signal 
before  quantization.  The  same  noise  is  subtracted  after  the  quantizer.  The  dither  noise  has  the 
effect  of  making  the  error  be  a  white-noise  process.  In  practice,  the  quantized  signal,  y'(n), 
is  transmitted  or  stored  and  it  is  not  practical  to  provide  the  dither  noise  signal,  z(n),  to  the 
decoder.  The  dither  noise  is  implemented  as  a  pseudorandom  sequence  that  can  be  generated 
by  both  the  encoder  and  decoder  without  information  exchange  except  for  synchronization. 
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Fig.II-3.  PCM  quantizer  with  dither. 


The  error  for  the  quantizer  with  dither  can  be  shown  to  be  zero  mean  white  noise  that  is 
statistically  independent  of  the  input.  The  noise  power  is  the  same  as  it  is  for  the  system  with¬ 
out  dither. 

Since  harmonic  distortion  is  not  pleasant  to  listen  to,  the  system  with  dither  is  rated  as 
having  a  higher  subjective  quality  than  the  PCM  system  without  dither,  even  though  the  RMS 
error  has  not  changed.  It  is  interesting  to  note  that  adding  dither  to  2-  and  3-bit  PCM  systems 
decreases  the  intelligibility  while  increasing  subjective  quality  and  listener  preference.1 

3.  Instantaneous  Companding 

Most  audio  signals  of  interest  vary  in  short-time  average  power  over  time.  A  PCM  encoder, 
having  equally  spaced  quantization  intervals,  will  have  an  error  that  does  not  vary  in  amplitude 
for  different  input  amplitudes.  While  the  error  may  be  tolerable  for  loud  musical  passages  and 
speakers,  it  will  be  more  audible  for  lower-volume  time  intervals.  The  signal  power  in  speech 
may  vary  as  much  as  40  dB  among  speakers  and  environments.  For  example,  a  7-bit  PCM  sys¬ 
tem  set  for  a  loud  speech  segment  as  in  Section  II-B-1  for  full  use  of  the  quantizer's  range  would 
have  an  SNR  of  35  dB.  Another  speech  waveform  might  only  use  a  few  of  the  quantization  levels 
and  have  an  SNR  of  10  dB  or  less.  In  general,  an  extra  four  bits  are  necessary  in  PCM  to  com¬ 
pensate  for  the  wide  variance  of  speech  signal  power.  Music,  where  60-dB  differences  of  power 
in  different  passages  are  common,  would  require  significantly  more  bits. 

The  problem  of  encoding  signals  with  large  dynamic  range  (the  ratio  of  the  largest  and  small¬ 
est  short-time  energy  levels)  can  be  reduced  by  companding.  Companding  is  achieved  by  com¬ 
pression  of  the  signal  before  quantization  to  reduce  the  dynamic  range  and  subsequent  expansion 
after  quantization  to  undo  the  effects  of  compression.  Companding  is  often  combined  with  quan¬ 
tization  by  using  a  nonuniform  distribution  of  the  quantization  levels  in  a  PCM  system.  Thus  it 
is  often  referred  to  as  nonlinear  PCM. 

To  maintain  the  SNR  over  any  region  of  the  input  dynamic  range,  it  is  necessary  to  quantize 
the  logarithm  of  the  signal  magnitude.  Unfortunately,  this  requires  an  infinite  number  of  quan¬ 
tization  levels  as  the  slope  of  the  logarithm  input-output  relation  is  infinite  for  inputs  near  zero. 
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Fig.II-4.  Distribution  of  quantization  levels  for  p-law  with  3  bits. 


A  compromise  is  the  use  of  nonlinear  relations  such  as  the  p-law, 5,6  popular  in  the  United 
7  8  9 

States,  and  the  A-law,  ’  '  popular  in  Europe.  As  both  are  very  similar,  only  the  p-law  is 
described.  The  compression  function  is  shown  in  Eq.(II-3)  and  its  effect  on  the  distribution  of 
quantization  levels  for  3 -bit  PCM  is  shown  in  the  graph  of  Fig.II-4: 

log  [  1  +  p  I 

F  lx(n)l  =  xmax  - log  [1  +  Sgn  [x(n)1  (II’3) 

where 

xmax  =  Maximum  magnitude  input  signal  permitted 
p  =  A  system  parameter 

As  was  desired,  the  quantization  levels  are  distributed  in  a  logarithmic  fashion.  As  the  vari¬ 
able  p  is  increased  from  zero  -  the  no-compression  setting  -  the  amount  of  compression  in¬ 
creases  at  the  expense  of  a  loss  in  the  SNR  for  large  amplitude  inputs.  A  comparison  of  SNR 
for  PCM  and  p-law  nonlinear  PCM  is  shown  in  Fig.II-5  as  a  function  of  the  energy  in  the  input 
signal.  Note  that  for  7  bits,  the  companded  system  maintains  a  SNR  of  30  dB  or  greater  until 
the  input  level  is  40  dB  below  clipping.  For  use  in  telephone  quality  systems  where  30-dB  SNR 
is  the  standard,  the  7-bit  p-law  compander  is  comparable  with  an  11  -bit  PCM  system. 
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Fig.II-5.  SNR  for  p-law  and  uniform  quantization  as  a  function 
of  input  signal  level  and  number  of  quantization  bits. 


C.  Adaptive  Quantization 

Instantaneous  companding  is  one  solution  to  the  problem  of  encoding  signals  that  vary  in 
amplitude.  This  method  is  a  compromise,  sacrificing  SNR  at  high  input  levels  by  spacing  the 
quantization  levels  in  a  logarithmic  manner  to  get  a  higher  SNR  at  lower  input  levels.  Speech, 
music,  and  most  natural  sounds,  vary  slowly  in  amplitude  relative  to  the  sampling  rate.  Adap¬ 
tive  quantization  takes  advantage  of  the  slow  variation  by  changing  the  spacing  of  the  quantization 
levels  as  a  function  of  the  power  in  the  input  averaged  over  a  short  period  of  time.  This  modi¬ 
fication  is  syllabic  companding,  implying  adaptation  over  intervals  comparable  to  syllable  lengths 
in  speech.  In  practice,  the  variation  may  be  quite  a  bit  quicker  than  the  rate  of  speech  syllables. 
Adaptive  quantization  maintains  the  SNR  at  a  lower  bit  rate  than  instantaneous  companding  be¬ 
cause  information  specifying  the  signal  power  is  not  coded  into  each  sample.  Rather,  it  is  spread 
over  the  syllabic  time  interval  related  to  the  rate  of  variation  of  the  input  power. 

1 .  Syllabic  Companding 

As  can  be  done  in  instantaneous  companding,  the  time-varying  adaptation  is  often  realized 
as  preprocessing  and  postprocessing  to  a  uniform  quantization  PCM,  referred  to  as  adaptive 
PCM  (APCM).  This  scheme  is  depicted  in  Fig.II-6.  The  gain  is  varied  so  as  to  keep  constant 
the  level  of  the  input  to  the  quantizer. 
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Fig.II-6.  Feedforward  APCM  with  time-varying  gain. 


The  rate  of  adaptation,  the  rate  at  which  the  gain  in  Fig.II-6  is  allowed  to  change,  is  an 
important  variable.  By  adapting  very  quickly,  the  RMS  error  will  be  small  because  the  input 
is  always  utilizing  the  full  range  of  the  PCM  quantizer.  The  amount  of  information  necessary 
to  encode  the  gain  variations,  though,  has  increased  proportionally  to  the  frequency  bandwidth 
of  the  adaptation.  By  adapting  the  gain  more  slowly,  the  bit  rate  for  the  encoding  of  the  gain 
decreases.  These  bits  may  be  shifted  to  the  quantizer,  compensating  for  not  using  the  entire 
quantization  range  during  periods  that  the  input  varies  too  rapidly.  If  the  adaptation  is  too  slow, 
the  problem  when  no  adaptation  is  used  returns.  In  summary,  it  is  desired  to  vary  the  gain  as 
slowly  as  possible  without  allowing  an  audible  lowering  of  the  SNR  and  without  overloading  the 
quantizer. 

Overload  can  be  eliminated  by  a  block  adaptation  scheme,  a  method  for  implementing  a 
noncausal  adaptation  gain  function.  By  filling  a  buffer  with  a  block  of  samples,  the  value  of  the 
prequantization  gain  can  be  determined  as  a  function  of  the  samples  in  the  buffer.  Then,  the 
gain  can  adapt  for  an  attack  transient  or  other  quick  increase  in  signal  energy  to  avoid  any  over¬ 
load.  In  this  scheme,  however,  the  error  magnitude  is  determined  by  the  maximum  magnitude 
sample  value  in  a  block.  During  rapid  and  abrupt  changes  in  input  signal  level,  the  error  may 
be  relatively  large  compared  to  the  low  energy  portion  of  the  input  signal  block,  especially  when 
the  transient  occurs  near  the  beginning  or  end  of  the  block.  Block  lengths  are  usually  chosen 
in  the  range  of  5  to  50  ms  as  a  compromise.  The  constraints  that  are  placed  on  the  block  length 
are  discussed  and  quantified  further  in  Sections  III-C  and  IV -D  in  terms  of  the  psychophysics  of 
the  auditory  system. 


9 


2.  Feedback  Adaptation 


A  method  for  eliminating  the  need  to  encode  the  adaptation  gain  information  is  to  make  the 
gain  be  a  function  of  the  previous  encoded  output.  Since  the  encoder  uses  the  output  in  its  pro¬ 
cessing.  it  is  called  feedback-adaptation  quantization.  As  this  information  is  already  available 
to  the  decoder,  the  decoder  can  derive  the  gain  with  no  additional  information.  A  block  diagram 
of  a  feedback-adaptation  system  is  shown  in  Fig.  II - 7. 


Fig. II —7 .  Feedback  APCM  with  time-varying  gain. 


The  gain  can  only  be  a  function  of  the  previous  outputs  because  the  present  output  is  not 
available  until  after  the  gain  has  been  set  and  the  sample  quantized.  If  the  gain  were  permitted 
to  change  again  based  on  the  present  output;  i.e.,  as  an  iterative  procedure,  the  information  of 
the  iteration  would  not  be  known  by  the  decoder. 

Signals  in  nature  such  as  speech  and  music  from  instruments  may  have  sharp  attack  tran¬ 
sients  where  the  signal  energy  greatly  increases  in  less  than  i  ms.  In  general,  though,  the 
decay  of  these  signals  is  much  slower.  In  a  feedback-adaptation  quantizer,  large  errors  can  be 
present  during  overload  conditions.  A  sudden  10-dB  jump  in  input  level  can  produce  an  error 
from  overload  which  is  greater  than  the  signal.  It  is,  therefore,  important  to  adapt  very 


quickly  by  lowering  the  adaptation  gain  when  the  input  level  increases.  In  contrast,  failure 
to  adapt  quickly  to  decreases  in  signal  energy  results  only  in  a  temporary  reduction  in  the  SNR 
by  the  amount  of  the  decrease.  Also,  for  signals  with  predominantly  low-frequency  energy, 
peaks  in  the  amplitude  of  steady-state  periods  may  occur  as  little  as  every  20  ms.  if  the  gain 
is  increased  on  a  time  scale  faster  than  this,  it  must  also  be  decreased  every  amplitude -peak 
sample.  This  may  produce  an  audible  pulsing  of  the  quantization  error  that  is  more  annoying 
than  a  steady-state  noise  would  be. 


D.  Predictive  Quantization 

Predictive  quantization  takes  advantage  of  the  correlation  that  may  exist  between  input  sam¬ 
ples.  A  prediction  of  the  sample  value  is  made  and  the  prediction  residue,  the  difference  of  the 
actual  and  the  predicted  signal,  is  quantized.  If  there  is  correlation  between  input  samples,  the 
prediction  residue  will  be  a  signal  with  less  power  than  the  input.  The  quantizer  can  be  adjusted 
for  the  smaller  difference  signal  and  will  produce  a  smaller  quantization  error.  A  block  diagram 
of  a  differential  quantization  system  is  shown  in  Fig.It-8,  From  the  decoder  we  see  that  the 
output  of  the  system  with  no  channel  errors  equals  the  quantized  difference  signal  plus  the  pre¬ 
dicted  signal  as  shown  in  Eq.(II-4): 


x  =  x  +  d 
=  x  +  (d  +  e) 

=  x  +  [(X  -  x)  +  e) 

=  x  +  e  .  ( II— 4 ) 


Thus,  the  system  error,  the  difference  between  the  input  and  the  output,  is  exactly  the  error  in 
quantizing  the  difference  signal.  Since  the  signal  that  is  quantized  is  the  prediction  residue,  the 
SNR  of  the  output  exceeds  the  expected  SNR  due  to  the  quantizer  by  the  amount  of  the  prediction 
gain,  the  ratio  of  the  input  to  the  residual  energies,10  i.e.. 


SNR 


P  Pj 
x  d 

P  .  P 
d  e 


( 1 1—5 ) 


where 

Px  =  Input  signal  power 
P^  =  Difference  signal  power 
Pg  =  Error  signal  power 

Systems  of  the  general  form  of  Fig.II-8  are  referred  to  as  differential  PCM  (DPCM)  systems. 


1.  Differential  Pulse  Code  Modulation 

For  encoding  speech,  the  improvement  in  SNR  by  using  a  DPCM  system  rather  than  PCM 

i  i 

is  about  6  to  8  dB  with  il  dB  possible  for  an  optimized  system.  Unfortunately,  there  has  been 
little  work  with  predictive  quantization  systems  for  music  encoding  so  that  it  is  difficult  to  quan¬ 
tify  the  SNR  advantage  with  music  input. 
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Fig.II-8.  Predictive  quantization  system. 


It  is  possible  to  combine  previously  described  schemes  with  predictive  quantization  to 
achieve  a  combination  of  the  improvements  of  each.  By  using  an  instantaneous  compander  such 
as  p-law  quantization  in  the  DPCM  system,  the  advantages  of  the  logarithmic  distribution  of 
quantization  levels  with  an  additional  6-dB  improvement  in  SNR  are  achieved. 

In  adaptive  differential  PCM  (ADPCM),  the  quantizer  and/or  predictor  are  permitted  to 
adapt  to  the  statistics  of  the  signal.  As  each  variable  in  the  DPCM  block  diagram  scales  with 
input  level,  the  dynamic  range  can  be  increased  by  varying  the  step  size  as  a  function  of  the 
energy  in  the  difference  signal  as  in  APCM.  By  using  an  adaptive -quantization  ADPCM  system 
to  encode  speech,  there  is  approximately  a  1.5-bit  (9-dB)  improvement  in  SNR  over  p-law  PCM. 
Subjective  quality  as  measured  by  listener  preference  shows  a  2.5-bit  improvement;  e.g.,  4-bit 
ADPCM  is  rated  between  6-  and  7-bit  p-law  PCM  in  quality. 

When  the  predictor  is  permitted  to  adapt  also,  the  predictor  gain  in  Eq.(II-5)  can  be  in¬ 
creased  additionally.  Using  an  optimum  12th-order  predictor,  prediction  gains  of  13  dB  for 
voiced  speech  and  6  dB  for  unvoiced  speech  are  typical.  If  the  input  speech  is  pre -emphasized 
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the  sample-to-sample  correlation  is  decreased  for  voiced  speech  sounds  in  exchange  for  the 

shaping  of  the  noise  spectrum  by  the  de-emphasis  filter  at  the  processor  output.  With  pre- 

12 

emphasis,  8-dB  prediction  gains  for  both  voiced  and  unvoiced  segments  are  typical.  As  a 
summarizing  example,  ADPCM  with  a  4th-order  adaptive  predictor  and  1-bit  adaptive  quantizer 

can  produce  intelligible  speech  at  16  kbps  comparable  to  5-bit  log  PCM  at  8-kHz  sampling,  a 

,  .  10 
60-percent  savings. 

2.  Delta  Modulation 

A  simple  DPCM  system  is  delta  modulation  (DM),  where  a  1-bit  quantizer  and  a  fixed  lst- 
order  predictor  are  used.  By  sampling  the  input  at  a  rate  much  higher  than  the  Nyquist  rate, 
the  input  is  assured  to  change  slowly  from  sample  to  sample.  The  1-bit  quantizer  can  be  set 
as  a  compromise  to  a  level  where  the  quantization  error  is  small,  but  the  differential  signal  is 

never  much  larger  than  the  quantization  step  size.  Sampling  rates  over  200  kbps  are  necessary 
13 

for  high-quality  speech. 

By  adapting  the  quantization  step  size  to  the  signal,  the  bit  rate  of  DM  can  be  reduced  dra¬ 
matically.  Adaptive  delta  modulation  (ADM)  uses  a  feedback  adaptation  scheme  based  on  the 
past  quantized  outputs  to  permit  tracking  of  the  input  signal  at  lower  rates.  Speech  quality  equal 
to  4-kHz  bandwidth,  7-bit  log  PCM  can  be  produced  at  the  same  bit  rate  using  ADM,  56  kbps. 

E.  Time-Frequency  Domain  Quantization 

The  object  of  all  schemes  for  the  encoding  of  speech  and  audio  is  to  approximate  the  signal 
with  the  smallest  error  for  a  given  information  rate.  For  speech,  models  of  the  speech  produc¬ 
tion  process  permit  tailoring  of  the  system  and  adaptation  of  the  parameters  for  increased  SNR. 
For  instrumental  music  and  other  natural  sounds,  signal  statistics  are  available  but  are  not  as 
predictable  and,  hence,  do  not  permit  as  much  improvement  as  with  speech  inputs. 

The  level  of  the  error  of  an  encoding  system  is  usually  measured  in  terms  of  SNR  with 
various  inputs.  Most  systems  are  optimized  to  achieve  the  highest  SNR  over  the  range  of  al¬ 
lowable  input  signals.  After  the  system  is  optimized,  it  is  tested  for  audibility  of  the  encoding 
error,  quality,  and  intelligibility.  By  initially  designing  the  encoder  to  minimize  the  audibility 
of  the  error,  further  improvements  may  be  made. 

The  systems  just  described  are  waveform  coders,  adapting  and  predicting  on  a  time-sample 
to  time-sample  basis  for  the  full -bandwidth  time  waveform.  The  encoding  error  is  a  noise  pro¬ 
cess  with  time-varying  variance  and  a  flat -power  spectral  density  that  may  be  postfiltered  by  de¬ 
emphasis.  The  audibility  of  the  error,  however,  is  dependent  on  the  dynamic  temporal  and 
spectral  relation  of  the  signal  and  error.  Time-frequency  domain  quantization  systems  attempt 
to  transform  the  signal  into  a  domain  where  quantization  is  better  matched  to  audition.  Although 
the  psychophysics  of  the  auditory  system  are  detailed  in  Chapter  3,  a  brief  review  of  existing 
time-frequency  domain  encoders  is  given  in  this  Section. 

1.  Sub -Band  Coding 

The  sub-band  coder  divides  the  signal  into  several  bandpass  frequency  channels,  typically 
4  to  8  in  number,  each  to  be  quantized  separately.  Use  of  APCM  quantization  for  each  band 
maintains  a  constant  SNR  over  a  large  range  of  input  levels.  A  block  diagram  of  the  sub -band 
coding  system  is  shown  in  Fig. II -9. 
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Fig.II-9.  Sub -band  coder. 

By  coding  each  sub -band  independently,  the  quantization  error  from  a  band  will  be  con¬ 
strained  to  be  in  that  band.  When  a  signal  with  predominantly  low  frequency  energy  is  coded  by 
the  waveform  coders  of  the  previous  sections,  the  quantization  error  is  white  noise.  The  high- 
frequency  components  of  the  noise  will  then  be  audible  unless  the  SNR  is  much  greater  than 
60  dB.  The  quantization  error  of  the  sub-band  coder  is  restricted  to  frequencies  close  to  the 
signal -frequency  components.  High-frequency  channels  with  low-signal  energy  will  contribute 
only  small  errors  with  energy  proportional  to  the  signal  energy  in  that  band.  An  SNR  of  40  dB 
may  be  sufficient  to  render  the  noise  inaudible. 

Another  advantage  of  the  sub-band  system  is  that  the  bit  rate  and,  hence,  the  SNR  of  each 
band  may  be  chosen  independently  of  the  other  bands.  In  speech  signals,  significant  high- 
frequency  energy  is  present  only  for  unvoiced  speech  sounds.  Since  this  is  a  noise-like  signal, 
greater  quantization  error  is  tolerable,  especially  if  it  is  shaped  to  the  speech  spectrum.  Thus, 
fewer  bits  can  be  used  to  encode  the  upper-frequency  channels. 

For  speech  coding  at  16  kbps,  sub-band  coding  has  a  slightly  higher  SNR  -11.2  dB  -  than 
ADPC M  -  1 0 .9  dB .  Subjectively,  however,  it  is  comparable  to  22-kbps  ADPCM. 15  It  is  unclear 
how  sub-band  coding  compares  with  the  optimized  adaptive  quantizer  and  adaptive  predictor 
ADPCM  at  16  kbps  that  has  an  SNR  of  17  dB.5  Note,  however,  that  this  ADPCM  receives  sig¬ 
nificant  predictor  gain  from  optimum  prediction  of  the  speech  waveform.  If  signals  other  than 
speech  were  used  as  input,  the  predictor  gain  would  decrease  greatly  while  the  sub-band  coder 
would  not  be  degraded  significantly. 

2.  Transform  Coding 

Transform  coding  is  another  technique  to  match  the  quantization  to  the  short-time  Fourier 
analysis  that  is  performed  by  the  auditory  system.  Whereas  the  sub-band  coding  system  quan¬ 
tizes  time  samples  from  a  window  in  frequency  (the  bandpass  filter  output),  transform  coding 
quantizes  frequency  samples  from  a  window  in  time.  Although  the  effects  are  similar,  the  im¬ 
plementations  differ  due  to  the  characteristics  of  the  specific  frequency  transformation  used  in 
the  transform  coding  and  differing  adaptation  strategies  due  to  the  statistics  of  the  signals  to  be 
quantized. 
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Adaptive  transform  coding  (ATC),  transform  coding  with  an  adaptive  quantization-bit  dis¬ 
tribution  strategy,  yields  a  3-  to  6-dB  advantage  over  ADPCM  when  optimized  for  speech.16 
Perceptually,  however,  it  is  quite  different.  The  degradations  introduced  by  the  encoding  are 
not  perceived  as  noise.  They  are  manifested  as  changes  in  the  quality  of  the  signal.  ATC  can 
produce  toll -quality  (telephone  quality)  speech  at  a  bit  rate  of  16  kbps  for  3. 2 -kHz  bandwidth 
speech.1 7 
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III.  PERCEPTUAL  REQUIREMENTS  OF  A  DIGITAL  ENCODER 


A.  Introduction 

The  performance  of  any  audio  system  can  only  be  judged  by  how  it  sounds.  On  all  but  the 

highest  quality  sound  reproduction  system,  music  through  16-hit  PCM  encoding  systems  will 

sound  identical  to  the  original  signal.  Experiments  by  the  British  Broadcasting  Corporation 

(BBC)  and  German  Post  Office  show  that  13 -bit  PCM  encoding  yields  acceptable  quality  for  many 

7181920 

applications  in  radio  broadcast.  ’  *  ’  If,  however,  the  digital  signal  is  to  be  processed  fur¬ 

ther,  the  sum  of  the  errors  due  to  13 -bit  encoding  and  subsequent  processing  may  sound  degraded 
with  respect  to  the  original  and  16 -bit  PCM  signals. 

To  decide  when  errors  will  be  audible  and,  therefore,  what  is  important  in  the  design  of  an 
encoding  system,  it  is  necessary  to  have  an  understanding  of  the  capabilities  and  limitations  of 
the  auditory  system.  Although  there  are  no  complete  models  of  the  auditory  system,  there  is  a 
large  body  of  literature  detailing  experiments,  relating  various  psychoacoustic  phenomena,  and 
modeling  certain  aspects  of  the  auditory  system.  This  literature,  along  with  some  simple  ex¬ 
periments,  can  be  used  to  specify  the  perceptual  requirements  of  a  digital  encoder  for  audio 
signals. 

In  this  Chapter,  the  concepts  of  masking  and  critical  bands  are  reviewed  briefly  and  quanti¬ 
fied  with  reference  to  the  results  of  experiments  from  the  literature.  It  will  be  hypothesized 
that  by  tailoring  the  spectral  and  temporal  aspects  of  the  error  signal  from  a  digital  encoder, 
it  is  possible  to  specify  conditions  under  which  the  error  is  rendered  inaudible  when  presented 
along  with  the  audio  signal.  Thus,  the  output  of  the  encoder  will  sound  identical  to  the  input. 

B.  Masking  and  Critical  Bands 

By  its  very  nature,  the  approximation  of  a  continuous  audio  waveform  by  a  discrete  digital 
bit  streem  will  have  an  error.  Depending  on  the  encoding  system,  this  error  or  noise  signal 
may  or  may  not  be  correlated  with  the  audio  signal.  The  quality  of  the  system  is  dependent  on 
the  audibility  of  the  encoding  noise. 

The  approximate  range  of  normal  human  hearing  spans  from  20  Hz  to  20  kHz  in  frequency. 

As  seen  in  Fig.  Ill -1 ,  the  threshold  of  audibility  for  pure  tones  varies  from  a  minimum  at  3.5  kHz 

of  — 4-dB  sound  pressure  level  (SPL)  relative  to  the  standard  pressure  of  0.0002  dyne/sq  cm, 
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the  normal  threshold  at  1  kHz.  The  threshold  rises  to  a  maximum  of  73-dB  SPL  at  20  Hz. 
Levels  above  140-dB  SPL  are  usually  painful,  representing  the  practical  upper  amplitude  limit 
of  hearing.  To  try  to  design  for  this  dynamic  range  of  144  dB  would  require  major  advances  in 
the  state-of-the-art  in  electronic  instrumentation.  Even  if  this  range  is  restricted  to  100  dB, 
a  more  practical  limit  with  respect  to  most  speech  and  audio  signals,  a  linear  PCM  encoder 
would  require  16-bit  coding.  Several  digital  encoding  systems,  however,  do  use  16-bit  PCM  to 
achieve  large  dynamic  range  and  very  low  noise. 

The  thresholds  shown  in  Fig.  III-l  are  for  pure  tones  in  isolation.  When  a  tone  is  presented 
along  with  other  signals,  the  threshold  may  be  raised  in  a  manner  dependent  on  the  spectral  and 
temporal  characteristics  of  the  other  signal.  This  is  masking,  the  phenomenon  of  one  audio 
signal  inhibiting  the  detection  of  another  audio  signal.  Figure  III-2  shows  how  a  complex  signal 

composed  of  two  tones  is  preceived.  The  graph  is  for  a  1200 -Hz,  80 -dB  SPL  primary  tone  and 
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is  plotted  as  a  function  of  secondary-tone  frequency  and  level.  ’  Under  these  conditions,  the 
masked  threshold  of  the  secondary  tone  -  the  minimum  level  where  the  secondary  or  target  tone 
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Fig.  Ill  - 1 .  Threshold  of  audibility  or  pure  tones. 
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Fig.  111-2.  Perception  of  a  two-tone  signal. 


is  perceived  along  with  the  primary  or  masking  tone  —  is  higher  than  the  unmasked  threshold  from 
Fig.  ni-1.  The  amount  that  the  threshold  is  raised  is  the  amount  of  masking  due  to  the  masking 
tone.  Note  that  the  amount  of  masking  is  greater  for  frequencies  near  the  masking  frequency  and 
that  there  is  little  masking  at  frequencies  below  the  masking  frequency. 

By  replacing  the  masking  tone  by  a  narrow  band  of  noise,  the  masked  threshold  is  no  longer 

dependent  on  the  exact  frequency  relation  of  masker  and  target.  Figure  III-3  shows  the  masking 
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pattern  of  a  90-Hz  narrowband  noise  signal  presented  at  several  sound  pressure  levels.  '  The 
notches  due  to  "beat"  phenomena  in  Fig.  in-2  have  been  replaced  by  peaks  in  the  masking  audio- 
gram.  The  general  shape  of  the  curves  indicates  that  the  amount  of  masking  in  the  vicinity  of 
the  center  frequency  of  the  band  of  noise  is  approximately  linearly  related  to  the  noise  power, 
i.e.,  a  10-dB  increase  in  noise  power  results  in  a  10-dB  increase  in  the  sinusoid  masked  thresh¬ 
old.  For  the  experiment  in  Fig.  Ill— 3,  the  masked  threshold  of  a  410-Hz  tone  centered  in  a  70-dB 
SPB  noise  masker  is  the  unmasked  threshold  (7-dB  SFL)  plus  the  amount  of  masking  (53  dB) 
for  a  total  of  a  60-dB,  SPL-masked  threshold.  Thus,  the  level  of  the  tone  must  be  within  10  dB 
of  the  masker  power  to  be  audible  at  this  frequency. 
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Fig.  Ill -3.  Masking  audiogram  of  a  narrow  band  of  noise. 


The  frequency  range  of  masking,  as  well  as  many  other  psychoacoustic  phenomena  such  as 

loudness  perception  and  phase  audibility  is  related  to  the  critical  band,  the  bandwidth  where 

there  is  sudden  change  in  observed  subjective  responses.  The  critical  band  is  often  defined  as 

the  range  of  frequencies  of  a  noise  signal  that  contribute  to  the  masking  of  a  pure  tone  centered 
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in  frequency  in  the  band  of  noise.  '  '  It  is  measured  by  masking  a  tone  by  bandpass -filtered 

white  noise.  The  bandwidth  of  the  masker  is  decreased  until  the  masked  threshold  of  the  tone 

starts  to  decrease.  The  bandwidth  of  the  noise  masker  at  that  point  is  defined  to  be  the  critical 

bandwidth  at  the  frequency  of  the  sinusoid.  A  graph  of  critical  bandwidth  as  a  function  of  fre- 
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quency  and  compared  to  third  octaves,  sixth  octaves,  and  5-percent  articulation  index  ’  is 
shown  in  Fig.  HI -4. 
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FREQUENCY  (Hz) 

Fig.  Ill— 4 .  Bandwidth  of  critical  bands  and  5%  articulation  index  bands. 

It  is  shown  in  Fig.  III-5  that  the  linear  relationship  of  masker  power  and  amount  of  masking 

conjectured  from  the  curves  in  Fig.  III-3  is  true  and  essentially  independent  of  the  frequency  of 
25  29 

the  tone.  '  White  noise  has  been  filtered  to  a  critical  bandwidth  at  six  frequencies.  At  sev¬ 
eral  levels  at  each  frequency,  the  narrow  band  of  noise  is  used  to  mask  a  sinusoid  centered  in 
the  noise.  The  amount  of  masking  of  the  tone  is  plotted  as  a  function  of  the  level  of  the  critical 
oand  of  noise  relative  to  its  threshold.  At  each  frequency,  an  increase  in  masker  level  is  ac¬ 
companied  by  an  equal  increment  in  dB  of  the  amount  of  masking. 

For  use  in  a  digital  audio  encoder,  signals  other  than  narrowband  noise  must  be  considered 
is  possible  masking  signals  and  signals  other  than  sinusoids  as  masking  targets.  The  error 
rom  an  encoding  system  will  typically  be  a  noise -like  signal.  A  PCM  encoder  is  a  time -invariant 
lonlinear  system  that  yields  an  error  that  is  a  deterministic  function  of  the  input.  As  the  number 
>f  bits  per  sample  is  increased,  the  correlation  of  the  signal  and  the  error  is  reduced.  Percep- 
ually,  the  error  sounds  the  same  as  white  noise  when  five  or  more  bits  per  sample  are  used. 

’hus.  it  is  important  to  consider  the  masking  of  noise  signals.  Consistent  with  the  masking 
urves  in  Fig.  III-3,  there  is  very  little  masking  of  widt  band  white  noise  by  narrowband  signals 
uch  as  tones.1  ’31  Narrowband  signals  would,  however,  be  expected  to  mask  narrowband 
oise. 

Not  finding  any  pertinent  experiments  in  the  literature,  the  following  experiment  was  per- 
>rmed.  For  each  of  18  frequencies,  narrowband  noise  was  masked  by  a  sinusoid  at  a  listening 
:vel  of  approximately  70 -dB  SPL  at  the  center  frequency  of  the  band  of  noise.  The  noise  was 
stained  by  passing  wideband  noise  through  a  4 -pole  Butterworth  filter.  The  bandwidths  of  the 
Iters,  shown  in  Table  III-l,  are  approximately  of  critical  bandwidth.  The  sinusoid  and  noise 
gnals  were  presented  together  monaurally  over  headphones.  Each  subject  varied  the  amplitude 
the  noise  until  it  was  at  the  minimum  audible  level.  The  results  of  three  subjects  were 
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Fig.  HI-5.  Relation  between  the  masking  by  white  noise  and  the  effective 
level  of  the  noise.  The  effective  level  is  the  power  in  the  critical  band 
around  the  masked  sinusoid. 
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TABLE  lll-l 

FILTER  BANDWIDTHS  USED  IN  MASKING  EXPERIMENT 

Channel 

Filter  Center 

Fi  Iter 

Number 

Frequencies 

Bandwidths 

0 

240 

130 

1 

360 

130 

2 

480 

130 

3 

600 

130 

4 

720 

130 

5 

840 

130 

6 

1000 

165 

7 

1150 

165 

8 

1300 

165 

9 

1450 

165 

10 

1600 

165 

11 

1800 

220 

12 

2000 

220 

13 

2200 

220 

14 

2400 

220 

15 

2700 

330 

16 

3000 

330 

17 

3300 

330 

averaged  and  are  presented  in  Fig.  Ill— 6.  The  SNR  necessary  for  the  noise  to  be  masked  varies 
with  frequency,  reaching  a  maximum  of  28  dB  at  1150  Hz.  Thus,  the  frequency  at  which  the 
least  amount  of  masking  is  present  is  1150  Hz.  The  maximum  variation  of  a  subject  from  the 
average  was  4  dB.  This  indicates  that  tones  are  effective  at  masking  narrowband  noise,  but 
the  amount  of  masking  is  less  than  the  amount  of  masking  for  narrowband  noise  masking  of  tones. 

C.  Nonsimultaneous  Masking 

The  previous  experiments  all  use  signals  that  are  presented  simultaneously.  Judgments  of 
audibility  were  made  when  the  signals  were  in  the  steady  state  and  not  during  transients.  Mask¬ 
ing,  however,  also  occurs  when  the  stimuli  are  not  presented  simultaneously.  A  masking  signal 
can  mask  sounds  occurring  before,  referred  to  as  backward  or  premasking,  or  after,  referred 
to  as  forward  or  postmasking.  Although  the  time  period  for  nonsimultaneous  masking  is  short, 
less  than  100  ms,  it  is  very  important  to  the  question  of  digital  encoding.  By  not  requiring  the 
system  to  adapt  instantaneously  to  transients  in  the  input  waveform,  a  large  saving  in  the  amount 
of  information  to  specify  the  signal  and,  hence,  a  lowering  of  the  bit  rate  can  be  obtained.  This 
principle  is  used  in  the  design  of  compressors,  limiters,  and  automatic  gain  controls  for  audio 
recording  and  broadcast  industries. 
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Fig.  ni-fe.  Ratio  of  sinusoid  masker  level  to  critical  bandwidth 
noise  level  at  masked  threshold. 

In  Fig.  Ill— 7  the  masking  of  short  tone  bursts  by  a  single,  critical-bandwidth,  noise  masker 
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is  shown  as  a  function  of  the  timing  of  the  tone  burst  in  relation  to  the  masker.  The  masker 
is  narrowband  noise  centered  at  8.5  kHz,  1.8  kHz  in  bandwidth,  and  presented  at  70-dB  SPL  for 
a  duration  of  500  ms.  The  tone  bursts  at  frequencies  of  6.5,  8.5,  and  11  kHz,  were  1  ms  in 
duration.  The  unmasked  thresholds  of  the  tone  bursts  are  also  shown  for  reference.  For8.5-kHz 
tone  bursts  centered  in  frequency  in  the  band  of  noise,  the  nonsimultaneous  masked  threshold  is 
within  20  dB  of  the  simultaneous  masked  threshold  when  the  tone  burst  occurs  within  10  ms  be¬ 
fore  or  30  ms  after  the  noise-masker  burst. 

If  the  duration  of  the  tone  bursts  are  increased  slightly  the  shape  of  the  curves  remain  un¬ 
changed  but  are  shifted  downward;  i.e.,  there  is  less  masking.  This  shift  is  by  approximately 

the  same  factor  as  the  increase  in  energy  of  the  burst  in  agreement  with  the  short-time  temporal 

1 3 

integration  in  the  auditory  system.  Thus,  increasing  the  duration  from  1  to  10  ms  lowers  the 
masked  threshold  by  about  10  dB;  i.e.,  to  15  dB  below  the  masker  level. 

The  results  of  these  experiments  are  summarized  in  Fig.  III-8.  The  backward,  forward, 
and  simultaneous  masking  curves  have  been  combined  to  reflect  the  total  transient  masking  pat¬ 
tern  of  the  tone  burst  by  the  critical -band  noise  masker. 

D.  Summary 

The  masking  data  indicate  that  a  sinusoid  raises  the  masked  threshold  of  a  narrowband  noise 
to  within  28  dB  of  the  sinusoid  level.  Other  experiments  imply  that  nonsinusoid  maskers  having 
a  greater  spectral  spread,  provide  even  more  masking. 
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Fig.  III-7.  Nonsimultaneous  masking  of  tone  pulses  by  critical  bandwidth  noise 
masker  bursts:  (a)  forward  masking,  (b)  backward  masking. 
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Fig.  HI-8.  Transient  masking  pattern  of  a  single,  critical  bandwidth, 
noise-masker  impulse. 


The  signals  that  are  to  be  encoded  —  speech,  music,  and  other  audio  signals  -  are  not  narrow- 
band  signals  as  are  the  tone  and  noise  maskers  of  the  previously  described  experiments.  It  is 
possible  to  divide  the  wideband  audio  signal  into  several  narrowband  signals  by  filtering  tech¬ 
niques.  It  is  hypothesized  that  if  the  perceptual  requirements  for  the  encoding  error  to  be  masked 
are  met  for  every  narrowband  frequency  section  of  the  output  signal,  then  the  error  will  still  be 
masked  when  all  of  the  frequency  bands  are  presented  together.  This  hypothesis  simply  states 
that  masking  is  additive  for  masker-target  pairs  that  are  nonoverlapping  in  frequency. 

To  meet  these  requirements  efficiently,  a  system  should  have  an  error  that  adapts  to  the  in¬ 
put  signal.  The  noise  spectrum,  as  measured  by  a  short-time  Fourier  transform  or  through  a 
bank  of  bandpass  filters,  should  be  shaped  so  that  the  SNR  in  every  frequency  band  of  critical 
bandwidth  is  adequate  for  masking.  Temporally,  the  system  should  adapt  quickly  enough  such 
that  errors  occurring  shortly  before  or  after  transients  will  also  be  masked. 
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IV.  DESIGN  OF  THE  DIGITAL  ENCODER 


A.  Introduction 

By  shaping  dynamically  the  spectrum  of  the  encoder  error,  it  is  possible  to  render  the 
error  inaudible  by  the  masking  action  of  the  input  signal.  The  masking  curves  in  Chapter  III 
show  that  when  the  noise  is  restricted  to  a  narrow  band  a  sinusoid  will  mask  the  noise  if  the 
SNR  is  sufficient;  i.e.,  greater  than  that  shown  in  Fig.  Ill -6,  or  a  maximum  of  28  dB  for  any 
frequency  band.  For  nonsteady-state  signals,  temporal  restrictions  on  the  error  signal  are 
implied  by  the  nonsimultaneous  masking  results.  These  experiments  are  used  to  define  a  time- 
frequency  domain  similar  to  the  short-time  spectral  analysis  domain  used  for  analysis  by  the 
auditory  system.  By  transforming  signals  into  this  domain,  the  quantization  process  can  be 
better  matched  to  audition. 

Two  basic  techniques  are  currently  used  for  the  signal  transformation,  as  was  noted  in 
Section  III-E.  The  subband  coder  uses  a  bank  of  bandpass  filters  to  separate  the  signal  into 
several  independent  frequency  channels  for  quantization.  The  transform  coder  uses  a  discrete 
cosine  transform  to  form  a  short-time  spectral  analysis  of  a  windowed  time  segment.  Although 
either  method  performs  the  desired  transformation,  bandpass  filtering  was  chosen  since  the 
temporal  and  spectral  parameters  in  the  implementation  are  related  closely  to  auditory  per¬ 
formance  parameters.  In  this  Chapter,  the  block  diagram  of  the  system  is  presented  and  the 
implementation  of  the  blocks  described.  Relation  of  the  parameters  to  the  perceptual  require¬ 
ments  analyzed  in  Chapter  III  are  discussed.  Details  of  the  signal  processing  aspects  of  the 
implementation  are  discussed  in  Appendix  I. 

B.  Block  Diagram  of  the  Encoding  System 

The  structure  of  the  encoding  system  is  shown  in  Fig.  IV-1.  The  audio  signal  is  filtered 
into  a  set  of  24  contiguous  frequency  bands  that  cover  the  audible  frequency  range.  For  input 
signals  that  do  not  require  system  response  to  15  kHz.  fewer  bands  are  used.  For  speech  in¬ 
puts,  17  filter  channels  cover  the  range  to  4.1  kHz.  By  quantizing  each  channel  independently, 
the  quantization  error  can  be  restricted  to  the  frequency  band  of  that  channel.  Quantization 
accuracy  and  temporal  adaptation  characteristics  may  also  be  chosen  according  to  the  discrimi¬ 
nation  ability  of  the  auditory  system  in  each  frequency  range.  To  ensure  that  the  quantization 
error  is  masked  by  the  signal,  each  channel  is  approximately  of  critical  bandwidth  or  smaller. 


Fig.  IV-1.  Digital  encoding  system. 
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After  the  signal  is  filtered  into  narrow  frequency  bands,  each  channel  may  be  resampled 
at  a  lower  sampling  rate  commensurate  with  the  channel  bandwidth;  i.e.,  at  the  Nyguist  rate  of 
the  channel.  If  the  channels  are  contiguous  and  nonoverlapping,  the  sum  of  the  sampling  rates 
of  the  channels  would  equal  the  original  sampling  rate.  Unfortunately,  an  ideal  bandpass  filter 
is  not  a  causal  function  and  is,  therefore,  not  realizable.  Implementation  considerations  of 
the  filter  bank  are  discussed  in  Section  IV-C. 

One  implementation  solution  that  achieves  the  desired  spectral  shaping  is  to  eliminate  the 
resampling  operations  before  and  after  quantization.  Sampling  would  then  be  at  a  rate  much 
higher  than  the  Nyguist  rate  as  occurs  in  a  delta  modulation  system.  Assuming  that  the  quan¬ 
tization  error  can  be  approximated  by  a  wideband  white -noise  process,  filtering  to  the  signal 
bandwidth  after  quantization  will  remove  out-of-band  noise.  The  SNR  will  be  increased  by  the 
ratio  of  the  noise  bandwidth  to  the  channel  bandwidth.  The  total  bit  rate  for  this  encoding  will 
also  increase  by  the  same  factor  as  there  are  more  samples  per  second  to  be  quantized.  For 
each  doubling  of  the  sampling  rate  over  the  Nyquist  rate  and,  hence,  doubling  of  the  bit  rate 
(assuming  no  change  in  the  number  of  quantization  bits  per  sample),  there  is  a  3-dB  increase 
in  the  SNR  resulting  from  filtering  out  half  of  the  noise  energy.  Doubling  of  the  bit  rate  by 
maintaining  the  sampling  at  the  Nyquist  rate  and  increasing  the  accuracy  of  the  quantization  by 
doubling  the  number  of  bits  per  sample,  however,  would  result  in  the  squaring  of  the  SNR  a  dou¬ 
bling  of  the  SNR  expressed  in  dB.  In  general,  therefore,  unless  significant  predictor  gains  can 
be  achieved  with  the  over-sampled  signal  by  using  a  DPCM- quantization  scheme,  sampling  a 
channel  at  higher  than  its  Nyguist  rate  does  not  represent  an  efficient  allocation  of  the  bits  avail¬ 
able  for  encoding. 

For  the  quantization  error  to  be  masked  by  the  input  signal,  it  is  necessary  to  shape  the 
spectrum  of  the  error  dynamically  as  a  function  of  the  short-time  spectrum  of  the  signal.  This 
implies  that  the  quantization  algorithm  must  maintain  relative  accuracy  over  a  large  dynamic 
range.  The  choice  between  instantaneous  companding  and  adaptive  quantization  algorithms  is 
discussed  in  Section  IV-D. 

After  quantization,  each  channel  is  resampled  up  to  original  sampling  rate.  This  resam¬ 
pling  process  causes  replicas  of  the  narrowband  signal  spectrum  in  other  parts  of  the  frequency 
range  of  the  system,  parts  covered  by  other  channels.  Filtering  each  channel  to  its  original 
frequency  band  before  summing  with  the  other  channels  eliminates  these  images.  If  the  signal¬ 
processing  requirements  have  been  followed,  the  encoding  error  present  in  the  reconstructed 
output  will  be  due  only  to  the  quantization  and  not  to  artifacts  of  the  implementation. 

C.  Design  of  the  Filter  Bank 

Several  issues  are  important  in  the  design  of  the  filter  bank  from  the  viewpoint  both  of 
theory  and  implementation.  The  bandwidth  of  each  channel  in  the  filter  bank  should  be  a  critical 
bandwidth  or  smaller  so  that  the  encoding  error  of  the  channel  is  masked  by  the  channel  signal. 
For  the  greatest  reduction  in  the  bit  rate,  it  is  desired  to  have  the  lowest  possible  resampling 
rate  for  each  channel.  All  realizable  filters  have  finite  transition  bands  and  finite  attenuation 
in  the  stop  bands.  Because  of  the  aliasing  that  occurs  for  under-sampling,  the  sample  rate  for 
each  channel  must  be  at  least  twice  the  bandwidth  to  the  stop-band  edges.  The  resultant  aliasing 
will  be  due  only  to  signals  at  frequencies  leaking  through  the  filter  stop  band.  By  using  filters 
with  enough  stop-band  attenuation,  this  aliasing  will  be  inaudible.  Since  it  is  necessary  for  the 
sum  of  the  channel  signals  to  sound  the  same  as  the  original  input  signal,  the  passbands  must 


be  continguous.  Thus,  the  narrower  the  transition  bands  of  each  filter,  the  less  over-sampling 
will  be  necessary  to  avoid  audible  aliasing.  Although  high-order  recursive  (IIR)  digital  fil¬ 
ters  (such  as  1 6th -order  elliptic  filters)  have  good  specifications  in  terms  of  transition  band¬ 
width  for  the  channel  bandwidth  in  question,  they  have  rapid  phase  fluctuations  at  the  band  edges. 
The  resultant  phase  distortion  and  ripple  in  the  magnitude  of  the  frequency  response  is 
often  audible  and,  therefore,  not  acceptable.  The  use  of  nonrecursive  (FIR)  filters  having 
linear  phase  can  eliminate  this  problem.  Nonrecursive  filters,  however,  have  only  zeros  in 
their  z-plane  response  and  require  a  much  higher  order  than  recursive  filters  for  similar- 


width  transition  bands.  A  windowed  bandpass-filter  design  for  a  signal  with  a  sampling  rate  of 
30  kHz  having  a  50-Hz  transition  band  would  require  an  FIR  filter  with  an  approximate 
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Several  schemes  for  implementing  the  filter  bank  have  recently  been  demonstrated. 

A  filter  bank  consisting  of  equally  spaced,  frequency-translated  replicas  of  a  prototype  low- 
pass  filter,  leads  to  a  simple  condition  for  the  sum  of  the  channels  to  equal  the  input.  Each 
of  the  bandpass  filters  has  an  impulse  response  as  shown  in  Eq.  (IV-d),  where  h(n)  is  the  im¬ 
pulse  response  of  a  lowpass  filter  with  bilateral  bandwidth  equal  to  the  bandwidth  of  the  band¬ 
pass  filter: 


hk(n)  =  h(n)  cos[uknl 


(IV-1) 


An  implementation  of  the  kth  filter  is  shown  in  Fig.  IV-2.  It  is  now  necessary  to  design  only 
one  lowpass  filter  to  produce  the  bank  of  bandpass  filters.  Choosing  the  center  frequencies  of 
the  N  channels  so  that  the  filters  are  equally  spaced  and  cover  the  frequency  band,  the  impulse 
response  of  the  system  is  Eq.  (IV-2): 


Fig.  IV-2.  Implementation  of  one  channel  of  a  filterbank. 


Amplitude  Spectra 

BP:  mf  to  (m  +  i)  f 

(m  *  2) 

SAMPLE  AT  2f 

SAMPLED  SIGNAL 

DESAMPLED  SIGNAL 


(CROCHIERf,  ET.  »L  ,  1916) 

Fig.  IV-3.  Integer-band  sampling  technique  for  subband  coding. 


Fig.  IV -4.  Two-channel  quadrature  mirror-filter  decomposition. 
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(N-l)/2 

g(n)  =  Yj  h(n)  cos  [2irkn/N] 
n=l 

(N-l)/2 

=  h(n)  Y  cos  [2jrkn/N] 
n=i 

=  h(n)  d(n)  .  (IV-2) 

Since  d(n),  the  sum  of  equally  frequency-spaced  cosines  in  Eq.  (IV-2),  is  a  pulse  train  with  a 
spacing  of  n  =  N,  the  composite  system  response  is  just  a  sampling  of  the  prototype  lowpass 
filter.  The  lowpass  filter  is  designed  such  that  when  sampled,  it  is  an  impulse,  the  desired 
impulse  response  of  the  system.  An  efficient  implementation  of  this  system  has  been  produced 
using  the  fast  Fourier  transform  (FFT)  algorithm.  It  is  difficult,  however,  to  adapt  this  sys¬ 
tem  implementation  to  the  case  of  unequal  bandwidth  channels. 

Another  system  that  implements  bandpass  filters  uses  integer-band  sampling.1  A  band- 
limited  signal  may  be  sampled  at  the  Nyquist  rate  of  twice  its  bandwidth.  This  is  often  imple¬ 
mented  by  modulating  the  signal  so  that  the  band  of  interest  is  centered  at  zero  frequency.  A 
lowpass  filter  is  then  used  and  the  signal  can  be  down-sampled  without  aliasing.  The  integer- 
band  sampling  scheme  does  not  modulate  the  signal  bands  to  zero  frequency.  Bandpass  filters 
are  used  and  each  band  is  sampled  at  its  Nyquist  rate.  When  the  band  is  not  a  lowpass  signal 
channel,  care  must  be  taken  so  that  the  sampling  does  not  cause  aliasing  of  the  signal  back  into 
its  frequency  band  even  though  the  sampling  rate  is  sufficient.  If  the  band  limits  are  chosen 
such  that  the  low-frequency  cutoff  is  an  integer  multiple  of  the  channel  bandwidth,  the  signal 
may  be  down-sampled  without  modulation  to  baseband  with  no  resultant  aliasing.  This  restric¬ 
tion  is  a  fundamental  limitation  with  the  integer-band  sampling  scheme.  The  system  imple¬ 
mentation  is  shown  in  Fig.  IV-3.  If  the  bandpass  filter  is  a  nonrecursive  filter,  efficiency  can 
be  gained  by  computing  only  those  values  that  will  be  sampled.  For  a  100-Hz  bandpass  filter 
to  be  down-sampled  from  10  kHz  to  200  Hz,  this  implies  calculating  one  sample  of  each  fifty. 

If  recursive  filters  are  used,  every  sample  must  be  computed  since  the  filter  bases  the  output 
on  past  outputs  as  well  as  inputs.  The  main  advantage  of  the  integer-band  sampling  system  is 
simplicity  and  smaller  computational  loads  because  modulation  to  the  zero  frequency  is  not 
necessary. 

A  major  disadvantage  of  these  filter-bank  systems  is  that  the  filters  must  be  very  sharp 

so  that  the  amount  of  over-sampling  necessary  to  avoid  aliasing  is  small.  A  recently  developed 
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filtering  technique,  quadrature  mirror  filtering,  ’  permits  the  realization  of  a  filter  bank  with 
no  aliasing  and  with  total  sampling  rate  equal  to  the  sampling  rate  of  the  signal  before  filtering. 

1.  Quadrature  Mirror  Filtering 

The  basic  quadrature  mirror-filter  technique  is  designed  to  divide  the  digital  frequency 
spectrum  into  two  equal  parts.  Each  band  is  sampled  at  half  the  rate  of  the  original  signal  for 
quantization,  coding,  and  transmission.  The  signals  are  then  resampled  back  up  to  the  original 
rate  and  filtered  again  before  summing.  Since  the  filters  are  not  ideal  filters,  there  is  some 
aliasing  after  the  down-sampling.  Because  of  the  special  relationship  of  the  filters,  when  the 
two  bands  are  summed,  the  components  due  to  the  aliasing  of  one  band  cancel  the  components 
due  to  aliasing  of  the  other  band.  Thus,  there  is  no  aliasing  in  the  resultant  signal.  The  struc¬ 
ture  of  the  basic  filter  block  is  shown  in  Fig.  IV-4.  A  summary  of  the  important  aspects  of 
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quadrature  mirror  filtering  is  given  in  this  Chapter.  A  more  detailed  discussion  can  be  found 
in  Appendix  I. 

When  the  four  filters  shown  in  Fig.  IV-4  are  defined  by  their  relationship  to  the  prototype 


lowpass  filter.  h(n),  as  in  Eq.  ( IV - 3 ),  the  aliasing  will  vanish: 

h^n)  =  h(n)  (IV-3a) 

h2(n)  =  (—  l)n  h(n)  (IV-3b) 

k1(n)  =  h(n)  (IV-3c) 

k2(n)  =  -(-l)n  h(n)  .  (IV-3d> 

These  relations  in  terms  of  the  z -transforms  of  the  filters  can  be  written  as: 

Ht(z)  =  H(z)  (IV -4a) 

H2(z)  =  H(— z)  ( I V — 4b ) 

K^z)  =  H(z)  (IV-4c) 

K2(z)  =  -H(-z)  .  (IV-4d ) 


The  replacement  of  the  parameter  z  by  — z  represents  a  rotation  by  an  angle  of  it  in  the  z-plane. 
This  is  a  shift  of  n  in  the  Fourier  transform  of  these  signals.  Thus.  H(— z)  is  a  highpass  filter 
with  a  frequency  response  being  a  shifted  replica  of  the  frequency  response  of  H(z),  a  lowpass 
filter.  If  h(n)  is  a  real  function,  the  magnitude  of  its  frequency  response  is  an  even  function. 

A  shift  of  n  will  be  equivalent  to  a  reflection  of  the  magnitude  about  the  point,  w  -  n/Z.  Hence, 
H(-z)  is  the  mirror  filter  of  H(z). 

The  output  of  the  system,  S(z),  shows  that  the  aliasing  components  have  cancelled,  leaving 
only  a  linear  filtered  term  as  desired: 

S(z)  =  |  (H2(z)  -  HZ(— z)J  X(z)  .  (IV— 5 ) 

Note  that  there  have  been  no  restrictions  up  to  this  point  on  the  prototype  filter,  h(n),  only  on 
the  relationships  of  the  other  filters  to  h(n).  Design  of  this  filter  is  discussed  in  Section  IV-C-2. 
If  h(n)  is  a  good  approximation  to  the  ideal  half-band  lowpass  filter,  the  signal  will  be  divided 
into  two  equal  bandwidth  frequency  bands  by  it  and  its  highpass  mirror  filter. 

In  summary,  if  the  relations  of  Eqs.  (IV- 3)  and  (IV-4)  are  used,  the  signal  components  due 
to  aliasing  are  cancelled,  leaving  only  linearly  filtered  signal  components  in  the  output.  If  the 
filter  is  chosen  such  that: 

\  \  (H2(z)  -H2(-z))  |  =  1  ( IV -6 ) 

and  the  phase  is  linear,  then  the  system  will  be  an  identity  system  except  for  a  delay. 

The  quadrature  mirror-filter  system  can  be  simply  extended  to  divide  the  frequency  range 
into  more  than  two  bands.  A  system  that  implements  a  decomposition  into  four  bands  is  shown 
in  Fig.  IV-5.  If  the  conditions  on  the  filters  are  met  such  that  the  basic  two-band  system  is  an 
identity  system,  then  introduction  of  that  identity  system  in  the  middle  of  another  system  will 
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Fig.  IV-5.  Four-channel  quadrature  mirror-filter  decomposition. 

not  affect  the  overall  system  output.  It  is  clear  that  in  that  case,  the  output  in  Fig.  IV-5  will 
be  identical  to  the  input. 

If  the  only  requirement  placed  on  the  filters  in  the  4 -channel  system  is  that  they  follow  the 
relationships  of  Eq.  (IV-3),  the  aliasing  components  will  vanish.  The  system  will  behave  as  a 
linear  filter  with  system  response  described  as  follows: 

2 -Channel  System  Function 
_  S(Z) 


G(z)  = 


=  |  [H2(z)  -H2(-z)]  . 


(IV-7) 


4-Channel  System  Function 


G'(z)  =  \  [H2(z2)  -  H2(— z2)]  [H2(z)  -  H2(-z)] 


=  G(z  )  G(z) 


(IV-8) 


In  the  case  that  G(z)  is  an  identity,  then  G'(z)  will  also  be  an  identity  system. 

Decomposition  into  any  number  of  channels  that  is  a  power  of  two  can  be  performed  by 
further  extension  of  the  basic  scheme.  In  the  same  manner,  it  can  also  be  shown  that  the  alias¬ 
ing  components  will  vanish  if  the  same  relationships  of  the  filters  are  maintained.  The  linear 
filter-system  response  will  be  a  function  of  the  basic  filter  that  is  used.  For  example,  the 
8-frequency  band  case  will  have  the  system  function,  G"(z): 

8-Channel  System  Function 


G" (z)  =  G'(z2)  G(z) 


=  G(z4)  G(z2)  G(z) 


(IV- 9 ) 
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Fig.  IV-6.  Two-filter  quadrature  mirror-filter  decomposition  with  quantization. 


Fig.  IV-7.  Four-channel  quadrature  mirror-filter  decomposition  with  quantization 


Although  the  decomposition  into  bands  that  are  not  of  equal  bandwidth  does  permit  some  aliasing 
at  the  output,  this  is  minor  and  is  discussed  in  Section  IV-C-3  and  in  Appendix  I. 

2.  Design  of  the  Mirror  Filters 

It  is  shown  in  Appendix  i  that  the  class  of  filters  that  result  in  perfect  reconstruction  of 
the  signal;  i.e.,  an  identity  system,  is  not  suitable  for  a  practical  system.  This  Section  will 
explore  the  issues  involved  in  the  design  of  the  prototype  filter  for  use  in  the  system. 

There  are  several  design  constraints  on  the  filters.  With  no  quantization,  coding,  or 
processing  other  than  the  system  filtering  and  resampling,  the  frequency  response  of  the  sys¬ 
tem  should  not  introduce  audible  coloration  to  the  signal.  The  reconstructed  signal  at  the  out¬ 
put  must  sound  identical  to  the  input  signal.  When  quantization  and  coding  are  performed,  the 
encoding  error  of  a  band  should  be  contained  to  a  spectral  region  close  to  that  band. 

The  use  of  linear  phase-FIR  filters  will  result  in  a  system  function  with  no  phase  distor¬ 
tion,  only  linear  phase  components.  The  magnitude  response  of  the  basic  half-band  lowpass 
filter  should  be  3  dB  down  at  ir/2,  the  crossover  with  its  mirror-image  highpass  filter.  When 
the  signal  passes  through  a  filter  twice  in  each  channel,  the  response  of  each  channel  will 
be  6  dB  down  at  crossover.  The  two  channels  will  than  sum  correctly  at  that  frequency. 

By  using  filters  with  low-ripple  passband  response,  the  system-response  ripple  will  be 
minimized. 

When  error  is  introduced  to  a  channel  through  quantization  of  the  signal,  that  noise  is 
filtered  and  will  result  in  noise  at  the  output.  Figure  IV-6  shows  the  basic  2-channel,  quadra¬ 
ture  mirror-filter  system  with  quantization,  a  nonlinear  function  modeled  as  an  additive  error 
signal.  Since  resampling,  filtering,  and  summation  are  linear  functions,  this  additive  noise 
term  will  be  a  processed  additive  noise  term  at  the  output.  Resampling  affects  the  error  signal 
by  scaling  the  frequency  axis.  If  the  additive  error  is  wideband,  as  can  normallv  be  assumed, 
this  will  not  change  its  spectral  extent.  The  error  from  quantization  of  a  band  in  the  two-band 
system  will  result  in  a  wideband  noise  that  is  filtered  once  in  that  channel. 

The  multiband  system  is  more  complex  as  illustrated  by  the  four-band  system  in  Fig.  IV-7. 
The  error  signal  of  each  channel  is  resampled  and  filtered  twice.  The  resampling  operation 
after  filtering  scales  the  frequency  axis  in  a  manner  that  halves  the  filter  passband  bandwidth, 
sharpens  the  band  edges,  and  inserts  a  replica  of  the  filter  shifted  a  distance  of  it  from  the 
original.  Unfortunately,  the  band  edges  of  every  channel  error  signal  are  not  defined  by  the 
sharpened  filters.  Figure  IV-8  illustrates  the  filtering  of  the  wideband  noise  generated  by  the 
quantization  of  each  channel.  The  error-signal  transition  edges  of  channels  1  and  4  have  been 
sharpened  by  a  factor  of  two.  Channels  2  and  3,  however,  have  only  one  sharp  edge.  Hence, 
there  will  be  some  bands  where  the  transition  slope  will  only  be  as  steep  as  the  prototype  filter, 
h(n). 

As  discussed  in  Chapter  III,  there  is  little  masking  at  a  distance  of  more  than  a  critical 
band  from  the  masker.  Although  the  SNR  in  a  band  may  only  need  to  be  on  the  order  of  25  dB, 
the  noise  due  to  encoding  of  that  band  must  be  attenuated  80  dB  or  more  at  frequencies  away 
from  the  band  where  there  is  no  signal  energy.  Natural  signals  rarely  have  sharp  discontinuities 
in  their  spectrum  so  that  filter  attenuation  in  the  stop  band  next  to  the  transition  band  need  not 
be  much  more  than  40  dB.  However,  the  filter  must  have  a  stop-band  attenuation  that  continues 
to  increase  away  from  the  passband. 
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Fig.  IV-8.  Spectra  of  error  at  output  produced  by  quantization  of  each  channel 
in  a  four-channel  quadrature  mirror-filter  system. 
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The  filter  chosen  for  use  in  the  system  is  a  Hanning  windowed  ideal  bandpass  filter  of  length 
64.  The  width  of  the  transition  band  is  approximately  tt/8.  A  filter  of  higher  order  would  have 
resulted  in  sharper  transition  bands  at  an  increased  computation  cost.  Evaluation  of  the  system 
shows  that  this  filter  is  adequate.  The  minimum  stop-band  attenuation  is  44  dB.  A  Hanning 
window  was  chosen  over  other  possible  windows  because  attenuation  in  the  stop-band  increases 
at  d8  dB  per  octave  of  frequency  distance  from  the  passband.  This  rapid  increase  ensures  that 
little  quantization  noise  is  present  far  from  the  frequency  of  its  band. 

A  method  to  take  advantage  of  the  sharpening  of  the  filters  of  inner-band  splitting  would  be 
to  use  filters  of  different  lengths  for  different  stages  of  decompositiion.  This  could  result  in 
great  computational  savings.  In  the  eight-band  system,  for  example,  the  innermost  filters 
could  be  of  length  32,  the  middle  decomposition  filters  of  length  64,  and  the  final  filters  of 
length  128.  The  computation  necessary  would  be  equivalent  to  a  system  using  filters  of  length 
64,  exclusively.  All  of  the  bands  would  then  have  the  same  slope,  a  factor  of  two  sharper  than 
several  of  the  band-edge  slopes  in  a  system  with  all  filters  of  length  64. 


3.  Unequal  Bandwidth  Filter  Bank  Design 

The  basic  two-band,  quadrature  mirror-filter  system  and  the  extensions  to  four,  eight, 
and  larger  number  of  channels  are  designed  so  that  the  bandwidths  of  each  channel  are  identical 
To  match  the  bandwidths  of  each  channel  to  the  critical  bandwidths  desired,  it  is  necessary  to 
modify  the  scheme  to  permit  decomposition  into  bands  that  are  not  of  equal  bandwidths. 

The  simplest  solution  is  to  divide  the  frequency  range  into  many  small  bandwidth  channels. 
The  encoding  of  those  channels  that  were  to  be  of  larger  bandwidth  would  occur  after  partial 
reconstruction  of  the  channels;  i.e.,  the  sum  of  several  of  the  small  bands.  The  analysis  of 
equal  bandwidth  decomposition  shows  that  no  aliasing  occurs  in  this  scheme.  It  is  very  ineffi¬ 
cient  computationally  in  that  the  processing  to  divide  bands  that  will  not  be  used  in  their  fully 
divided  state  is  performed. 

A  more  efficient  method  would  be  to  use  a  partial  tree  structure  where  some  branches  are 
decomposed  further  than  other  bunches  and,  therefore,  result  in  smaller  bandwidth  channels. 
This  is  incorporated  in  Fig.  IV-9,  a  three-channel  system.  It  is  clear  that  if  it  were  possible 


Fig.  IV-9.  Three-channel  quadrature  mirror-filter  decomposition 


to  design  the  basic  two-channel  decompostion  as  an  identity  system,  its  presence  or  absence  in 
a  branch  would  not  affect  the  overall  system  function  and  any  partial  tree  structure  would  be  an 
identity  system.  It  is  shown  in  Appendix  I,  however,  that  it  is  not  practical  to  use  the  filters 
necessary  for  the  identity  system.  For  the  equal  bandwidth  decompositions  in  Section  IV-C-1, 
it  was  shown  that  if  the  special  relationships  of  the  filters  were  maintained,  the  aliasing  products 
would  vanish.  This  result  relies  on  the  symmetry  of  the  system  such  that  the  system  equations 
factor  properly.  When  a  partial  tree  structure  is  used,  some  aliasing  will  remain.  The  output 
of  the  three-channel  system  is  shown  in  Eq.  (IV-10): 

S(z)  =  j  [H2(z)  G(z2)  -  H2(-z)J  X(z) 

+  |  [G(z2)  -  1]  [H(z)  H(-z)]  X(-z)  .  (IV-10) 

In  terms  of  the  Fourier  transform: 

S(w)  =  j  [H2(o) )  G(2w)  -  H2((ij  +  *)]  X(o) ) 

+  |  IG(2u)  -  1)  [H(u)  H(u  +  »)]  X(w  +  7r )  .  (IV-11) 

Of  interest  is  the  second  term  in  Eqs.  (IV-10)  and  (IV-11)  representing  the  aliasing  in  the  output. 
The  aliasing  will  only  appear  in  the  frequency  overlap  of  the  filter  h(n)  and  its  mirror  filter; 
i.e.,  frequencies  where  the  product  H(oj)  H(co  +  ir)  is  nonzero.  This  overlap  is  only  in  the  region 
around  jt/2,  the  half-band  splitting  frequency.  The  aliasing  is  also  scaled  by  G(2u>)  -  1.  G(w) 
is  the  linear  system  function  of  the  basic  two-band  scheme  and  varies  from  unity  only  near  jt/2. 
The  frequency-scaled  G(2hj)  varies  from  unity  only  near  ir/4  and  in/4.  This  scaling  factor  is 
very  small  at  ir/2.  Hence,  the  aliasing  will  still  be  minimal  using  the  partial  tree  structure  for 
unequal  bandwidth  decomposition.  To  summarize  the  analysis  in  terms  of  design  constraints, 
there  will  be  little  aliasing  if  the  two-band,  linear-system-function,  G(co),  has  ripple  only  at  a 
frequency  of  ir/2  and  H(a>)  is  very  nearly  zero  at  a  frequency  of  3ir/4.  The  Hanning-window- 
design  filters  meet  this  requirement. 

Using  the  partial  tree,  unequal-bandwidth,  channel-decomposition  technique,  the  structure 
of  Fig.  IV-1  is  implemented.  The  actual  bandwidths  of  the  24-channel,  15-kHz,  audio-encoding 
system  and  of  the  17-channel,  4.1-kHz,  speech-encoding  system  are  given  in  Table  IV-1  .  These 
bandwidths  differ  from  the  critical  bandwidths  as  shown  in  Fig.  III-4  because  of  the  constraints 
imposed  by  the  decomposition  technique. 

D.  Quantization  Algorithms 

The  principal  requisite  of  the  quantization  scheme  is  that  it  adapt  to  changes  in  channel 
signal  level  to  maintain  a  constant  percentage  error.  For  a  given  number  of  bits  for  quantiza¬ 
tion,  the  quantization  levels  must  allow  for  the  amplitude  peaks  of  the  input  signal.  The  SNR 
in  each  channel  must  also  meet  the  masking  requirements  specified  in  Chapter  III. 

Instantaneous  companding  schemes  use  logarithmic  spacing  of  quantizer  levels  to  permit 
the  large,  dynamic,  input  range  necessary  for  speech  and  audio  signals.  The  RMS  error  in 
encoding  each  sample  is  a  constant  percentage  of  the  sample  value.  Each  sample  is  quantized 
independently. 


TABLE  IV- 1 

CHANNEL  BANDWIDTHS  OF  DIGITAL  ENCODER 


Channel 

Number 

4.1-kHz  Speech  Encoding  System 

15-kHz  Audio  Encoding  System 

Frequency  Range 
(Hz) 

Bandwidth 

(Hz) 

Frequency  Range 
(Hz) 

Bandwidth 

(Hz) 

0 

0  -  129 

129 

0  - 

117 

117 

1 

129  -  258 

129 

117  - 

234 

117 

2 

258  -  387 

129 

234  - 

351 

117 

3 

387  -  516 

129 

351  - 

468 

117 

4 

516  -  645 

129 

468  - 

585 

117 

5 

645  -  774 

129 

585  - 

703 

117 

6 

774  -  903 

129 

703  - 

820 

117 

7 

129 

937 

117 

8 

1033  -  1291 

258 

937  - 

1171 

234 

9 

1291  -  1549 

258 

1171  - 

1406 

234 

10 

1549  -  1807 

258 

1406  - 

1640 

234 

11 

1807  -  2066 

258 

1640  - 

1875 

234 

12 

2066  -  2324 

258 

1875  - 

2109 

234 

2324  -  2582 

258 

2109  - 

2343 

234 

14 

2582  -  3099 

516 

2343  - 

2812 

468 

15 

3099  -  3615 

516 

2812  - 

3281 

16 

3615  -  4132 

516 

3281  - 

3750 

468 

17 

3750  - 

4687 

937 

18 

4687  - 

5625 

937 

19 

5625  - 

6562 

937 

20 

6562  - 

7500 

937 

21 

7500  - 

9375 

1875 

22 

9375  - 

11250 

1875 

23 

, 

11250  - 

3750 

Speech  and  natural  audio  signals  do  not  vary  instantaneously  in  level.  The  average  power 
is  usually  constrained  by  attack  transients  with  time  constants  greater  than  10  ms,  and  decay 
transients  with  time  constants  greater  than  100  ms.  Adaptive  quantization  systems  take  ad¬ 
vantage  of  this  slow  variation  of  the  short-time  average  power  by  specifying  a  normalization 
gain  factor  at  a  much  lower  rate  than  the  sampling  of  the  signal.  This  yields  a  lower  bit  rate 
than  the  instantaneous  companders.  Many  adaptive  strategies,  however,  tend  to  overload  when 
presented  with  a  sharp  attack  transient.  For  a  short  time,  this  overload  can  produce  a  very 
large  error  that  would  be  audible.  Block-companding  adaptive  PCM  was  chosen  because  of  its 
immunity  to  overload.  Sometimes  known  as  block-floating-point  encoding,  the  scheme  uses  a 
quantized  block  maximum  magnitude  to  normalize  all  samples  in  the  block.  Each  sample  is 
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then  quantized  by  linear  PCM.  (An  optimized  distribution  of  quantization  levels  may  be  used. 
The  decrease  in  RMS  error  will  depend  on  how  well  the  sample  value  probability  distribution  is 
estimated  and  the  number  of  quantization  levels  in  use.)  By  quantizing  the  block  maximum  to  a 
level  larger  than  the  block  maximum,  overload  is  avoided. 

Since  the  samples  in  a  block  are  all  quantized  using  the  same  normalization,  the  RMS  error 
is  constant  over  the  block.  If  the  block  length  is  too  long,  an  increase  in  signal  level  near  the 
end  of  a  block  will  cause  a  proportional  increase  in  error  energy  throughout  the  block  that  will 
not  be  masked  by  the  backward  masking  (premasking)  effect  of  the  signal.  As  the  temporal 
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extent  of  backward  masking  is  much  less  than  forward  masking  (postmasking),  this  is  the  limit¬ 
ing  case.  The  block  lengths  were  chosen  to  be  inversely  proportional  to  the  channel  bandwidth 

as  is  the  time  window  in  the  peripheral  auditory  system  used  for  the  short-time  spectral  analysis 
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on  the  basilar  membrane.  Due  to  the  lack  of  nonsimultaneous  masking  data  as  a  function  of 
stimuli  frequency,  it  is  not  clear  whether  the  temporal  integration  for  masking  follows  the  same 
proportionality.  Block  lengths  of  eight  samples  in  each  channel  were  chosen  for  the  encoder. 
These  lengths  in  seconds  for  each  channel  are  given  in  Table  IV-2.  In  any  case,  it  is  likely 
that  the  block  lengths  can  be  made  longer  in  the  high-frequency  channels  since  virtually  all 
naturally  produced  audio  signals  have  attack  times  much  longer  than  the  block  lengths  of  these 
channels. 


V.  EVALUATION 


A.  Introduction 

The  digital  encoding  system  is  designed  so  that  the  reconstructed  output  signal  should  sound 
identical  to  the  input  signal  if  the  SNR  requirements  are  met  in  each  channel.  The  experiments 
to  estimate  those  requirements  were  performed,  however,  using  tone  and  narrowband  noise 
stimuli.  It  can  not  be  expected  that  the  perception  of  the  complex  sounds  that  are  present  in 
speech  and  audio  signals  is  exactly  the  same  as  for  the  primitive  stimuli.  The  perception  of 

speech  sounds,  for  example,  is  related  to  the  understanding  of  speech  and  the  generation  of 
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speech.  It  was  not  known  whether  the  masking  under  these  conditions  is  greater  than,  less 
than,  or  the  same  as  the  masking  present  in  the  simple  stimuli  experiments. 

The  system  was  evaluated  using  speech  signals  with  a  4.1 -kHz  bandwidth  and  music  selec¬ 
tions  of  15-kHz  bandwidth.  Experiments  were  performed  to  determine  the  level  of  quantization 
error  and  the  bit  rate  at  the  differential  threshold  of  the  encoding  error. 

For  many  applications,  perfect  speech  reproduction  is  not  necessary.  For  these  communi¬ 
cation  systems,  it  is  desired  to  have  highly  intelligible,  pleasant  and  natural  sounding  speech  at 
lower  bit  rates  for  use  on  less  expensive,  lower-capacity  channels.  To  meet  these  demands,  a 
17-channel,  3.2-kHz  version  of  the  system  was  evaluated  at  a  bit  rate  of  16  kbps. 

B.  High-Quality  Speech  Encoding 

To  test  the  performance  of  the  system  for  speech  encoding,  a  17-band,  4.1-kHz  configura¬ 
tion  was  used.  The  goal  was  to  determine  the  lowest  bit  rate  and  the  corresponding  quantization 
bit  distribution  to  encode  speech  at  which  the  encoding  error  is  just  detectable.  The  error  in¬ 
tensity  at  which  the  encoding  error  is  just  detectable  is  the  masked  or  differential  threshold  of 
the  encoding  error  in  the  presence  of  the  signal  at  that  signal  level.  Controlled  experiments 
were  performed  to  determine  this  threshold. 

The  source  material  was  24  phonetically  balanced  sentences,  each  approximately  two  seconds 
in  length.  Four  speakers,  two  male  and  two  female,  were  recorded  with  an  Electrovoice  667 
microphone  in  a  soundproof  room  in  digital  format  through  a  16-bit  A/D  converter.  For  testing, 
the  speech  was  played  back  directly  from  digital  storage  through  a  16-bit  D/A  converter  and 
Stax  SR-X  electrostatic  headphones  in  the  soundproof  room. 

Experiments  consisted  of  two-interval,  two-alternative,  forced-choice  (2I2AFC)  trials  to 
compare  the  original,  unprocessed  sources  to  the  encoded,  processed  versions  of  the  same  sen¬ 
tence.  Each  interval  of  a  trial  was  chosen  randomly  and  independently  of  the  other  interval  to 
have  either  the  original  or  processed  version  of  a  sentence.  Thus,  each  of  the  four  permutations 
of  original  and  processed  versions  of  the  same  sentence,  same  speaker,  were  likely  to  occur. 

Subjects  were  asked  to  judge  the  two  intervals  as  being  identical  or  not  identical.  Any  au¬ 
dible  difference,  noise,  distortion,  coloration,  etc.,  was  valid  for  a  "not  identical"  judgment. 
Subjects  were  told  a  priori  that  the  probabilities  of  the  intervals  being  the  same  or  being  differ¬ 
ent  were  equal. 

If  a  subject  could  not  discriminate  any  audible  differences  between  the  original  and  processed 
sentences  in  any  trial,  the  probability  of  a  correct  response  would  be  50-percent,random  guess¬ 
ing.  Just-detectable  degradation  would  result  in  a  probability  of  75-percent  correct,  half  way 
between  50-percent  chance  and  100-percent  perfect  detection.  Thus,  the  error  present  at  this 
score  is  defined  as  the  differential  sensory  threshold  of  the  encoding  degradation  (JND). 
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The  test  sessions  consisted  of  24  training  trials  with  feedback  followed  by  48  trials  that  were 
scored.  Seven  sets  of  system  parameters,  each  representing  a  quantization  bit  distribution  and 
a  resultant  bit  rate,  were  used.  Four  to  six  subjects  were  tested  with  each  set  of  system  pa¬ 
rameters.  The  results  of  the  experiments  are  presented  in  Table  V-l  and  Fig.  V-i.  A  statis¬ 
tical  analysis  of  the  experimental  procedure  and  explanation  of  the  results  can  be  found  in 
Appendix  II. 

The  performance  of  the  system  as  a  function  of  the  quantization  bit  distribution  shows  that 
the  SNR  in  the  high-frequency  channels  can  be  less  than  that  needed  by  the  mid-frequency  chan¬ 
nels.  System  E  from  Table  V-i  with  a  bit  rate  of  34.4  kbps  yields  an  average  subject  perfor¬ 
mance  score  of  72-percent  correct,  just  under  the  JND  threshold.  Thus,  this  system  can  repro¬ 
duce  speech  without  audible  degradation.  It  is  difficult  to  compare  this  performance  with  other 
digital  encoders  because  few  systems  have  been  evaluated  at  this  high  performance  level.  It  can 
be  speculated  that  the  introduction  of  a  highly  optimized  scheme,  similar  to  that  of  adaptive 
transform  coding  (ATC),  could  further  reduce  the  bit  rate  and  yield  comparable  results. 


QUANTIZATION  BIT 
RESULTS  OF 

TABLE  V-l 

DISTRIBUTION  AND  EXPERIMENTAL 

SPEECH  ENCODING  SYSTEM 

Channel 

Number 

Frequency 
Range  (Hz) 

Quantization  Bit  Distribution 

A 

B 

C 

D 

E 

F 

G 

0 

0-  129 

4 

3 

4 

3 

3 

3 

3 

1 

129  -  258 

4 

4 

4 

4 

4 

3 

3 

2 

258-  387 

4 

4 

4 

4 

4 

4 

4 

3 

387-  516 

4 

4 

4 

4 

4 

4 

4 

4 

516-  645 

4 

4 

4 

4 

4 

4 

4 

5 

645  -  774 

4 

4 

4 

4 

4 

4 

4 

6 

774-  903 

4 

4 

4 

4 

4 

4 

4 

7 

903  -  1033 

4 

4 

4 

4 

4 

4 

4 

8 

1033  -  1291 

4 

4 

4 

4 

4 

4 

4 

9 

1291  -  1549 

4 

4 

4 

4 

4 

4 

4 

10 

1549-  1807 

4 

4 

4 

4 

4 

4 

4 

11 

1807  -  2066 

4 

4 

4 

4 

4 

4 

4 

12 

2066  -  2324 

4 

4 

4 

4 

4 

4 

4 

13 

2324  -  2582 

4 

4 

4 

4 

3 

4 

3 

14 

2582  -  3099 

4 

4 

4 

3 

3 

3 

3 

15 

3099  -  3615 

4 

4 

3 

3 

3 

3 

3 

16 

3615-4132 

4 

3 

3 

3 

3 

3 

3 

Date  Rate  (kbps): 

38.3 

36.9 

36.2 

34.9 

34.4 

34.6 

34.1 

Percent  Correct: 

63 

67 

68 

64 

72 

85 

94 

PERCENT  CORRECT  (%) 
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Fig.  V-l.  Relation  of  the  bit  rate  to  the  experimental  results. 


These  results  can  be  compared  to  the  results  of  the  masking  experiment  presented  in  Chap¬ 
ter  HI.  Those  results,  shown  in  Table  III- 1  and  Fig.  II1-6,  indicated  that  the  masked  threshold 
of  a  narrowband  of  noise  masked  by  a  sinusoid  at  70-dB  SPL  centered  in  the  noise  is  between 
42-  and  52-dB  SPL.  This  is  an  SNR  at  the  masked  threshold  of  18  to  28  dB  dependent  on  the 
frequency  of  the  stimuli.  The  SNR  present  in  each  channel  of  the  encoding  system  for  System  E 
can  be  estimated  from  the  quantization  algorithm.  The  encoding  degradation  for  the  quantization 
bit  distribution  of  System  E  has  been  found  to  be  below  the  masked  threshold. 

Each  channel  is  quantized  with  a  block -companding  APCM  scheme.  The  adaption  is  performed 
by  using  a  scale  factor  to  normalize  a  block  of  channel  samples.  Each  block  uses  one  of  24  scale 
factors,  each  6-dB  apart.  The  scale  factor  is  chosen  so  that  it  was  always  greater  than  the 
largest  magnitude  of  any  sample  in  the  block.  Thus,  it  averages  3-dB  greater  than  the  largest 


43 


sample.  After  normalization,  the  samples  in  the  block  are  quantized  by  linear  PCM.  Assuming 
that  the  companding  can  adapt  quickly  enough  such  that  the  quantization  range  up  to  the  largest 
sample  is  used  throughout  the  block,  the  SNR  is  approximately  6N-1  dB  for  N-bit  quantization. 
For  4-bit-per-sample  quantization,  the  SNR  is  23  dB  in  a  channel.  For  3-bit-per-sample  quanti¬ 
zation,  it  is  17  dB.  Hence,  the  amount  of  masking  of  quantization  error  by  speech  is  slightly 
greater  than  the  amount  of  masking  of  narrowband  noise  by  pure  tones. 

C.  High-Quality  Audio  Encoding 

Evaluation  of  the  encoding  scheme  was  performed  with  a  24-channel,  15-kHz  bandwidth  sys¬ 
tem.  Difficulty  in  obtaining  the  necessary  quality  source  material  and  computer  hardware  con¬ 
straints  limited  the  experiments  performed. 

The  source  material  was  musical  segments  of  25  to  40  seconds  in  length  taken  from  the 
selections  shown  in  Table  V-2.  The  music  had  been  recorded  on  record  disk  in  compressed  form 
by  the  use  of  an  analog  compander.  This  permits  a  much  larger  dynamic  range  than  would  have 
been  possible  without  companding.  While  the  quality  of  the  source  material  was  high,  it  would 
have  been  preferable  to  have  used  music  recorded  directly  in  digital  format. 

Music  segments  were  processed  with  the  quantization  bit  distribution  shown  in  Table  V-3, 
a  bit  rate  of  123.75  kbps.  Several  experienced  listeners  compared  the  original  segments  with 
the  processed.  Consensus  was  that  the  processed  music  was  audibly  identical  to  the  original. 


TABLE  V-2 

MUSIC  SELECTIONS  FOR  AUDIO  SYSTEM  EVALUATION 


1 .  Masters  of  Flute  and  Harp,  Vo) .  1, 

Klavier  Records  KS-556 

2.  Stan  Kenton  Plays  Chicago, 

Creative  World  Records  ST-1072 

3.  Rags  and  Other  American  Things, 

Eastern  Brass  Quintet, 

Klavier  Records  KS-539 

4.  The  Heralds  of  Love, 

Klavier  Records  KS-559 

5.  Bach:  Praeludim  from  Partita  #1  in  B  flat, 

Klavier  Records  KS-524 

6.  St,  Saens  Organ  Symphon  3  in  C  minor.  Opus  78, 
The  City  of  Birmingham  Orchestra, 

Klavier  Records  KS-526 
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D.  Toll-Quality  Speech  Encoding 

For  many  communication  systems,  it  is  not  necessary  to  have  perfect  reproduction  of  the 
speech  signals.  A  lower  quality  is  acceptable  as  long  as  it  is  highly  intelligible,  pleasant  and 
natural  sounding.  Quantifying  the  acceptable  level  is  difficult  since  it  is  actual-system-user 
acceptance  that  determines  the  necessary  fidelity.  In  the  literature,  this  quality  is  referred  to 
as  toll  or  telephone  quality. 

The  encoding  system  developed  here  was  designed  using  the  results  of  psychoacoustic  ex¬ 
periments  to  achieve  an  inaudible  encoding  error.  It  is  conjectured  that  when  the  level  of  en¬ 
coding  error  is  too  large  for  inaudibility,  the  system  will  degrade  the  speech  in  a  pleasant  man¬ 
ner.  The  spectral  and  temporal  shaping  of  this  degradation  would  be  such  that  a  larger  noise 
level  and,  therefore,  a  lower  bit  rate,  would  be  acceptable  to  a  listener  than  for  wideband  wave¬ 
form  coding  schemes  such  as  ADPCM.  The  achievement  of  toll  quality  speech  transmission  at 
16  kbps  with  an  ATC  system  supports  this  speculation.17 

For  evaluation  at  16  kbps,  a  17 -channel,  3.2-kHz  system  was  implemented.  Speech  pro¬ 
cessed  by  this  encoder,  however,  has  a  rough  quality.  To  correct  this  particularly  annoying 
form  of  degradation,  it  is  necessary  to  analyze  the  assumptions  made  in  the  design  of  the  en¬ 
coder.  The  block -companding,  adaptive-PCM-coding  strategy  was  based  on  nonsimultaneous 
masking  results.  At  a  bit  rate  of  16  kbps,  two  bits  is  the  average  quantization  per  sample,  not 
including  the  quantization  of  the  block  magnitude.  The  SNR  as  was  developed  in  Section  V-B  is 
approximately  11  dB.  Clearly,  the  quantization  error  will  not  be  masked.  The  block-coding 
method  causes  the  error  to  be  of  constant  level  for  each  entire  block.  In  speech  segments  where 
there  are  large  amplitude  transitions,  this  noise  will  tend  to  extend  the  speech  abruptly  to  the 
block  edges.  The  noise  energy  can  be  greater  than  the  speech  energy  right  before  an  attack 
transient  back  to  the  beginning  of  the  companding  block.  This  deformation  of  the  speech  can  be 
seen  on  the  speech  spectrograms  in  Fig.  V-2a  and  b. 

To  eliminate  this  source  of  distortion,  the  companding  scheme  was  modified  as  shown  in 
Fig.  V-3.  The  main  purpose  of  this  modification  is  so  that  the  adapting  magnitude  is  now  a 
smooth  function.  The  magnitude  of  the  samples  is  filtered  and  then  sampled  at  a  50-Hz  rate. 

The  adaptation  time  of  the  quantizer  is  20  ms.  Since  the  quantization  error  is  also  scaled  by  the 
magnitude  function,  it  is  now  a  smoothly  varying  noise  signal  in  each  channel.  The  system  is 
no  longer  immune  to  overload  on  sharp  transients  with  rise  times  less  than  20  ms.  The  problem 
of  overload  is  reduced  by  scaling  the  quantized  magnitude  function  to  permit  peak  samples  sev¬ 
eral  dB  larger.  This,  of  course,  increases  the  quantization  error  level  by  the  same  amount. 

A  compromise  value  was  found  to  optimize  the  perceptual  quality  of  the  output.  The  magnitude 
is  quantized  to  the  nearest  larger  level  of  a  set  of  levels  spaced  in  3-dB  increments.  It  is  then 
scaled  up  by  an  additional  3  dB.  The  result  can  be  seen  in  the  spectrogram  (Fig.  V-2c). 

At  a  bit  rate  of  15.8  kbps,  several  trained  listeners  compared  the  system  to  a  16-kbps 
adaptive -predictive -coding  (APC)  system.  High-quality  speech  as  well  as  sentences  with  varied 
levels  of  background  noise  were  used  as  input  signals.  The  quality  of  the  output  speech  of  both 
systems  were  similar  and  both  were  of  toll  quality.  The  robustness,  the  ability  to  reproduce 
speech  that  is  imbedded  in  background  noise,  of  the  system  also  compared  well  with  the  APC 
system. 
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VI.  SUMMARY  AND  TOPICS  FOR  FURTHER  RESEARCH 


A.  Summary 

In  this  repcrt,  a  digital  encoder  for  speech  and  audio  signals  has  been  described.  The  tech¬ 
nique  of  transfoi  mation  of  the  signal  to  a  domain  where  quantization  is  better  matched  to  audition 
was  developed  based  on  the  results  of  psychoacoustic  experiments.  By  exploiting  the  limited  de¬ 
tection  ability  of  the  auditory  system  as  determined  by  masking  experiments,  the  system  achieves 
performance  that  is  comparable  to  or  better  than  other  encoders  at  the  same  encoding  bit  rate. 
The  required  spectral  and  temporal  shaping  of  the  error  is  accomplished  by  use  of  a  multi¬ 
channel  system  with  adaptive  quantization  in  each  channel.  The  channel  bandwidths  and  quanti¬ 
zation  adaptation  properties  are  related  to  the  masking  results. 

Efficient  decomposition  of  the  input  signal  into  critical  bandwidth  channels  was  performed 

by  a  quadrature  mirror-filter  scheme.  The  method  was  developed  from  a  basic  two-channel  de- 
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composition  described  in  the  literature.  An  extensive  discussion  of  the  issues  of  design  and 
implementation  of  the  quadrature  mirror  filters  including  practical  design  of  the  filters  and  mod¬ 
ification  for  a  system  with  unequal  bandwidth  channels  was  presented. 

Evaluation  shows  that  the  system  achieves  encoding  that  is  audibly  identical  to  the  original 
signal  at  bit  rates  lower  than  were  accomplished  previously. 

B.  Topics  for  Further  Research 

The  design  of  digital  encoders  for  speech  and  audio  signals  that  account  for  the  processing 
and  limitations  of  the  auditory  system  is  a  new  field.  The  system  developed  here  demonstrates 
the  feasibility  of  such  systems  and  provides  a  framework  for  further  research.  Although  the 
signal  processing  necessary  for  the  implementation  of  this  system  requires  computation  that  for¬ 
bids  a  real-time  implementation  with  the  facilities  available,  the  technology  is  being  developed 
to  allow  it.  The  quadrature  mirror-filtering  technique  is  well  suited  to  charge-coupled  device 
(CCD),  sampled-data  filtering.  Within  a  few  years,  it  should  be  possible  to  implement  multi¬ 
channel  encoding  systems  inexpensively  as  real-time  systems. 

There  are  several  areas  for  further  development  of  this  digital  encoder.  The  psychoacous¬ 
tic  results  used  to  determine  the  perceptual  requirements  for  the  encoding  were  taken  primarily 
from  the  literature.  The  experiments  used  simple  stimuli  such  as  tones,  clicks,  and  white 
noise.  It  is  difficult  to  relate  these  results  to  the  perception  of  complex  signals  such  as  speech 
and  audio.  A  better  understanding  of  the  perceptual  requirements  for  speech  and  audio  signal 
processing  systems  is  necessary  for  further  system  development. 

The  SNR  required  in  each  channel  of  the  encoding  system  for  the  encoding  degradation  to  be 
at  the  threshold  of  detectability  is  slightly  less  than  the  SNR  necessary  for  critical  bandwidth 
noise  to  be  masked  by  a  sinusoid  centered  in  the  noise.  Although  the  critical  band  concept  ap¬ 
pears  in  many  psychoacoustic  phenomena  and  in  the  physiology  of  the  auditory  system,  it  is  not 
certain  that  it  applies  directly  for  this  system.  If  the  masking  by  a  sinusoid  is  concentrated  in 
a  bandwidth  smaller  than  a  critical  bandwidth,  then  a  system  with  smaller  bandwidth  channels 
may  perform  better  than  the  critical  bandwidth  channel  system.  On  the  other  hand,  if  the  spread 
of  the  masking  is  to  a  bandwidth  greater  than  a  critical  bandwidth,  a  system  with  fewer  channels, 
each  of  a  larger  bandwidth,  may  perform  as  well  as  the  critical  bandwidth  channel  system. 

Fewer  channels  would  require  less  computation  and,  therefore,  less  costly  implementation.  The 
results  of  an  experiment  comparing  the  amount  of  masking  of  various  bandwidths  of  narrowband 


noise  by  sinusoids  would  permit  furth-i  optimization  of  the  channel  bandwidths  of  the  encoding 
system. 

The  results  of  the  encoding  system  at  an  encoding  rate  of  16  kbps,  indicate  that  encoding  deg¬ 
radation  may  be  more  audible  during  certain  speech  sounds  than  others.  In  particular,  listeners 
appear  to  be  very  sensitive  to  segments  containing  onset  transients.  To  confirm  this,  an  exam¬ 
ination  of  the  amount  of  masking  by  different  speech  sounds  of  various  noise  signals  is  necessary. 
The  results  could  be  used  to  design  an  adaptive,  quantization  bit  distribution  system  that  could 
allocate  quantization  bits  temporally  and  spectrally  and,  therefore,  control  the  SNR  in  each  chan¬ 
nel  at  all  times,  according  to  the  particular  requirements  of  each  speech  sound. 

Another  area  is  development  of  the  signal  processing  techniques.  Introduction  in  multi¬ 
channel  encoders  of  algorithms  to  implement  processing  such  as  variable- length  coding  and  adap¬ 
tive  bit  distribution,  signal  prediction,  and  use  of  masking  between  channels,  could  result  in 
further  bit  rate  reduction.  The  development  of  these  techniques  depends  on  the  understanding  of 
the  perceptual  requirements  for  encoding  systems. 
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APPENDIX  I 


ANALYSIS  OF  QUADRATURE  MIRROR  FILTERING 


A.  BASIC  TECHNIQUE 

The  basic  quadrature  mirror-filter  technique  is  designed  to  divide  the  digital  frequency 
spectrum  into  two  equal  parts.  Each  band  is  sampled  at  half  the  rate  of  the  original  signal  for 
quantization,  coding,  and  transmission.  The  signals  are  then  resampled  back  up  to  the  original 
rate  and  filtered  again  before  summing.  Since  the  filters  are  not  ideal  filters,  there  is  some 
aliasing  after  the  down- sampling.  Because  of  the  special  relationship  of  the  filters,  when  the 
two  bands  are  summed,  the  components  due  to  the  aliasing  of  one  band  cancel  the  components 
due  to  aliasing  of  the  other  band.  Thus,  there  is  no  aliasing  in  the  resultant  signal. 

The  structure  of  the  basic  filter  block  was  shown  in  Fig.  IV-4.  After  down-sampling,  the 
z-transforms  of  the  signals  are  related  by  Eq.  (1-2): 


Xt(z)  =  Hj(z)  X(z) 


X2(z)  =  H2(z)  X(z) 


4  [X,(zl/2)  +X,(-zl/2) 


Yt(z)  =  I  [X^(z  '  ) 


=  j  [Ht(zl/2)  X(zl/2)  +H1(-zl/Z)  X(-zl/2)] 


4  [X,(zl/2)  +  XJ-zl/2) 


Y2(z)  =  j  [X2(z  '  )  +X2( 


=  4  [HJzl/2)  X(zl/2)  +H?(-zl/2)  X(-zl/2) 


The  second  terms  of  Eqs.  (I- 2a  and  b)  represent  the  aliasing  due  to  under-sampling.  The  resam¬ 
pled  process  again  scales  the  frequency  axis: 


U,(z)  =  Yj(z  ) 


j  [Hjlz)  X(z)  +  H^-z)  X(-z)] 


U2(z)  =  Y2(z‘) 


=  J  |H2(z)  X(z)  +H2(-z)  X(— z)) 


Filtering  again  and  summing: 


T j(z)  =  Kj(z)  Uj(z) 


4  (Kj(z)  Hj(z)  X(z)  +  Kj(z)  H j (—  z )  X(-z)) 


T2(z)  =  K2(z)  U2(z) 


=  j  (K2(z)  H2(z)  X(z)  +K2(z)  H2(-z)  X(-z)] 


(1-5) 


S(z)  =  Tt(z)  +  T2(z) 

=  j  (H1(z)  Kt(z)  +  H2(z)  K2(z)J  X(z) 

+  j  [Hj(— z)  K|(z)  +H2(-z)  K2(z))  X(— z) 

The  first  term  of  Eq.  (1-5)  represents  the  linear  filtered  components  of  the  signal.  If  the  down- 
sampling  was  removed  from  the  process,  this  term  would  remain  (scaled  by  a  factor  of  2)  while 
the  second  term,  due  to  aliasing,  would  vanish.  If  the  filters  were  ideal  nonoverlapping  lowpass 
and  highpass  filters,  the  aliasing  term  would  disappear  then  also. 

When  realizable  filters  are  used,  the  aliasing  terms  can  still  be  made  to  cancel.  One  sim¬ 


ple  solution  is  to  let  the  filters  be  related  as  follows: 

h^(n)  =  h(n)  (I-6a) 

h2(n)  =  (—  1  )n  h(n)  (I- 6b) 

k,(n)  =  h(n)  (I-6c) 

k2(n)  =  — (— t)n  h(n)  .  (I-6d) 

The  relations  of  the  z-transforms  of  the  filters  may  now  be  substituted  in  Eq.  (1-5): 

Ht(z)  =  H(z)  (I-7a) 

H2(z)=H(-z)  (I-7b) 

K1(z)=H(z)  ( I- 7c ) 

K2(z)=-H(-z)  (I-7d) 

S(z)  =  j  [H(z)  H(z)  -  H(— z)  H(-z))  X(z) 

+  j  [H(— z)  H(z)  -  H(— z)  H(z))  X(— z) 

=  \  [H2(z)-H2(-z)l  X(z)  .  (1-8) 


The  term  that  is  left  in  Eq.  (1-8)  represents  the  reconstructed  signal  after  being  processed 
by  the  linear  filters  present  in  the  system.  Note  that  there  have  been  no  restrictions  up  to  this 
point  on  the  basic  filter,  h(n),  only  on  the  relationships  of  the  other  filters  to  h(n).  Constraints 
on  the  filter  so  that  the  reconstructed  signal  will  be  identical  to  the  original  will  be  discussed  in 
Section  I-B.  If  the  filter  is  a  good  approximation  to  the  ideal  half-band  lowpass  filter,  the  band 
will  be  divided  into  two  equal  bandwidth  frequency  bands. 

Summarizing,  if  the  relations  of  Eqs.  (1-6)  and  (1-7)  are  used,  the  signal  components  due  to 
alias  ng  are  cancelled,  leaving  only  linearly  filtered  signal  components  in  the  output.  If  the  fil¬ 
ter  is  chosen  such  that: 

\j  [H2(z)-H2(-z))|=  1  d-9) 
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and  the  phase  is  linear,  then  the  system  will  be  an  identity  system  except  for  a  linear  phase 
component;  i.e. ,  a  delay. 

The  quadrature  mirror-filter  system  can  be  simply  extended  to  divide  the  frequency  range 
into  more  than  two  bands.  A  system  that  implements  a  decomposition  into  four  bands  was  shown 
in  Fig.  IV- 5.  If  the  conditions  on  the  filters  are  met  such  that  the  basic  two-band  system  is  an 
identity  system,  then  application  of  that  processing  in  the  middle  of  another  system  will  not  af¬ 
fect  the  overall  system  output.  It  is  clear  that  in  that  case  the  output  in  Fig.  IV- 5  will  be  iden¬ 
tical  to  the  input. 

Section  I-B  will  show  that  it  is  not  practical  to  use  filters  designed  such  that  the  system  is 
an  identity  system.  It  is  important,  therefore,  to  analyze  the  decomposition  into  four  bands  for 
filters  that  are  related  only  by  Eqs.  (1-6)  and  (1-7),  the  relation  that  guarantees  that  the  aliasing 
components  in  the  basic  two-channel  decomposition  cancel;  i.e.,  that  it  acts  like  a  linear  filter. 

The  outputs  of  the  inner  decomposition  are  related  to  their  inputs  by  the  system  function, 
G(z),  of  the  two-band  decomposition: 


G(z)  =  |  [H2(z)  -H2(-z)) 

(1-10) 

Y\(z)  =  G(z)  Yj(z) 

(I- 11a) 

Y'2(z)  =  G(z)  Y2(z)  . 

(I- lib) 

The  analysis  continues  by  taking  the  equations  derived  earlier  in  the  Section  and  modifying  to 
account  for  the  additional  processing.  The  (1-3)  equations  now  become: 

U'j(z)  =  Y'{  (z2) 

=  G(z2)  Yt(z2) 

=  G(z2)  j  |Hj(z)  X(z)  +  Hj(-z)  X(— z))  ( I- 1 2a.) 

U'2(z)  =  Y'2(z2) 

=  G(z2)  Y2(z2) 

=  G(z2)  j  [H2(z)  X(z)  +  H2(-z)  X(-z)]  .  (I-  12b) 

Filtering  and  summing: 

S' (z)  =  G(zZ)  i  |H,(z)  K j(z)  +  H2(z)  K2(z)]  X(z) 

+  G(z2)  |  [H,(— z)  Kj(z)  +  H2(-z)  K2(z)1  X(— z) 

=  G(z2)  (H2(z)  —  H2(—  z)|  X(z) 

=  G(z2)  G(z)  X(z)  .  (1-13) 
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Again,  the  aliasing  components  have  vanished  leaving  a  linear  filtered  system.  The  new  system 
function  is: 

G'(z)  =  =  G(z2)  G(z)  .  (1-14) 

This  equation  also  validates  the  earlier  conjecture  that  if  G(z)  is  an  identity  system,  then  G'(z) 
will  be  also. 

When  error  is  introduced  to  a  channel  through  quantization  of  the  signal,  that  noise  is  fil¬ 
tered  and  will  result  in  noise  at  the  output.  The  basic  two-channel  quadrature  mirror-filter  sys¬ 
tem  with  quantization,  a  nonlinear  function  modeled  as  an  additive  error  signal,  was  shown  in 
Fig.  IV- 6.  Since  resampling,  filtering,  and  summation  are  linear  functions,  this  additive  noise 
term  will  be  an  additive  processed  noise  term  at  the  output  as  shown  in  Eq.  (1-15): 

S(z)  =  G(z)  X(z)  +Kt(z)  Ej(z2)  +K2(z)  E2(z2)  .  (1-15) 

Resampling  affects  the  error  signal  by  scaling  the  frequency  axis.  If  the  additive  error  is  wide¬ 
band,  as  can  normally  be  assumed,  this  will  not  change  its  spectral  extent.  The  error  from 
quantization  of  a  band  in  the  two-band  system  will  result  in  a  wideband  noise  that  is  filtered  once 
in  that  channel. 

The  multiband  system  is  more  complex  as  was  illustrated  by  the  four-band  system  in 
Fig.  IV -7  and  analyzed  below: 

S(z)  =  G(z2)  G(z)  X(z) 

+  [K1(z2)  Ej(z4)  +  K2(z2)  E2(z4)]  Kj(z) 

+  [K((z2)  E4(z4)  +K2(z2)  E3(z4)]  K2(z) 

=  G(z2)  G(z)  X(z) 

+  Kj(z)  Kj(z2)  Ej(z4)  +K1(z)  K2(z2)  E.,(z4) 

+  K2(z)  K2(z2)  E3(z4)  +  K2(z)  K^z2)  E4(z4)  .  (1-16) 

Resampling  of  the  filtered  signal  of  each  channel  will  change  the  effective  filter  frequency  re¬ 
sponse  by  warping  of  the  frequency  axis.  The  spectra  of  these  error  signals  was  shown  in 
Fig.  IV- 8.  It  was  noted  then  that  this  warping  will  sharpen  the  filtering  on  some,  but  not  all  of 
the  frequency  channels. 

B.  DESIGN  OF  THE  QUADRATURE  MIRROR  FILTERS 
FOR  PERFECT  RECONSTRUCTION 

In  Section  A,  it  was  shown  that  the  frequency  band  could  be  subdivided  into  two  equal- 
bandwidth  mirror- image  bands  using  realizable,  overlapping  filters.  Each  of  these  bands  can 
be  resampled  at  half  the  original  sampling  rate  and  still  allow  later  recovery  of  the  original  sig¬ 
nal  if  the  filters  obey  certain  simple  relationships.  The  aliasing  present  after  down-sampling 
is  cancelled  when  the  bands  are  summed  when  the  four  filters  in  the  system  are  designed  from 
one  arbitrary  prototype  filter,  h(n).  Restrictions  on  this  filter  are  placed  solely  so  that  the 
linear  filtering  and  summation  operations  result  in  an  acceptable  system.  This  Section  is  con¬ 
cerned  with  the  design  of  the  filter. 
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The  output  of  the  system  was  described  by  Eq.  (1-8).  It  is  desired  that  the  system  be  an 
identity  system  (except  for  a  linear  phase  term  necessitated  by  the  use  of  causal,  realizable 
filters).  Since  the  nonlinear  aliasing  terms  have  vanished,  an  impulse  response  of  the  system 
and  its  transform,  the  system  function,  can  be  defined: 

C(z)  =  =  j  (H2(z)  -  H2(-z)]  (1-17) 

G(z)  =  z  ^-1)  desired  ,  (1-18) 

Evaluation  of  the  system  function  on  the  unit  circle  in  the  z -plane  is  the  Fourier  transform  of 
the  impulse  response: 

G(u)  =  |  [H2(co)  -  H2(oj  +  ir)]  (1-19) 

G(w)  =  exp[—  jcu(N  —  1)1  is  desired  .  (1-20) 

The  desired  impulse  response  is  just  a  delayed  impulse.  In  order  to  design  h(n),  it  is  necessary 
to  see  what  constraints  this  imposes. 

Let  the  filter  H(aj)  be  a  real,  causal,  symmetric,  FIR  filter  of  length  N.  The  linear  phase 
term  may  be  factored  out  of  the  Fourier  transform  leaving  a  real  and  even  filter  function,  de¬ 
noted  as  H'(<i)).  This  is  then  substituted  in  Eq.  (1-17): 

H(u)  =H'(w)  expI-jw(N-  l)/2]  (1-21) 

G(co)  =  |  {H'2(to)  exp [— jo>(N  -  1)]  -  H'2(w  +  *)  exp[-j(w  +  tt)  (N  —  1))} 

=  i  exp [— ju>(N  -  1)]  [ H,2(co )  -  (-I)1*'1  H|2(oj  +  7r ) ] 

=  i  exp[-jw(N- 1)]  [H,2(o>)  +(-1)NH,2(co  +  tt)]  .  (1-22) 

Since  it  was  assumed  that  H(u)  is  a  real,  symmetric  FIR  filter,  H1  (oj )  is  a  real  and  even  filter. 
Thus, 

H,2(o >)  -  H,2(o >  +  *)  evaluated  at  u>  =  y  .  (1-23) 


Unless  the  length  of  the  filter.  N,  is  even,  the  system  response,  G(a>),  must  have  a  zero  at  that 
frequency.  (Modification  of  the  system  to  permit  use  of  odd  filter  lengths  will  be  discussed  later 
in  this  Section.) 

G(z)  may  now  be  found  in  terms  of  the  transform  of  h(n)  using  Eq.  (1-17)  and  letting  the  length 
of  the  filter  be  even: 


where 


H(z)  =  hQ  4-hjz'1  +.  ..  +  hNJz 


h  =  h_T  . 
n  N-l-n 

and 

h0=hN-l^°  * 

(1-24) 

H2(z)  =  aQ  +  a^z 

1 

+  a  -(2N-2) 

•  +  a2N-2Z 

(I-  25a) 

H2(-z)  =  aQ  -  atz 

1  +.. 

+  a  -(2N-2) 

•  a2N-2 

(I-  25b) 
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(I- 25c) 


m 


G(z)  =  atz  1  +  a3z  3  +. . .  +  a 


2N-  3 


-(2N-3) 


where 


a  =  h.h  +  h.h  ,  +. 
n  On  1  n-1 


+  hnhO 


(I-25d) 


In  the  summation  to  form  G(z),  all  of  the  even-subscripted  terms  have  dropped  out.  All  of  the 
odd-subscripted  terms  other  than  n  =  (N  —  1)  must  be  set  to  zero  and  the  equations  solved  using 
Eq.  (I-25d): 

al  =  2h0hl  =  0 


h0^° 


then  hj  =  0 


a3  =  2  [hQh3  +  h1h2] 


(I-26a) 


h0  ^  0 


and 


By  induction  for  all  n  odd,  n  ^  N  -  1: 

a  =2  [h.h  +  h.h  ,  +. . 
n  0  n  1  n-1 


then  h3  =  0 


(I- 26b) 


+  h(n-l)/2h(n+l)/21  =  0 


h0  ^  0 


hl  =h3  = 


=  h  =0 
n-2 


then  h  =0 
n 


(I-26c) 


For  n  =  N  —  1: 


a-.  ,  =  2  [h.h  +  h.h  , 
N-i  l0n  i  n-1 


2h0hN-l  =  1 


+  h(n-l)/2h(n+l)/2] 


thus 


h0  =  hN-l  =  2'1/2 


H(z) 


2-1/2  +  2-,/2zN'1 


(I-26d) 


(1-27) 


All  of  the  odd-subscripted  terms  of  h(n)  must  be  zero  except  for  n  =  (N  —  1).  By  the  symmetric 
nature  of  the  filter,  all  of  the  even -subscripted  terms  except  n  =  0  must  also  equal  zero.  Hence, 
the  only  filters  that  can  result  in  the  desired  response  are  identically  zero  except  at  each  of  the 
end  points.  This  represents  a  class  of  filters  with  zeros  at  each  of  the  N  —  1  roots  of  —  1.  Un¬ 
fortunately,  this  class  of  filters  does  not  include  any  suitable  half-band  lowpass  filters  for  the 
system. 

It  is  interesting  to  relate  this  result  to  the  use  of  the  discrete  Fourier  transform  (DFT)  to 

effect  a  similar  transformation  of  domains.  The  DFT,  an  invertible  function,  transforms  N 

points  in  the  time  domain  to  N  points  in  the  frequency  domain.  The  DFT  may  be  implemented 

43 

by  down-sampling  the  outputs  of  a  set  of  N  linear  filters.  The  impulse  responses  of  the  filters 
are: 
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hk(n)  =  exp(— j2jrkn/Nj  ,  0  =  n  =  N  —  1 

0  ,  otherwise 

for  filter  k  ,  0=k^N-i  .  (1-28) 

A  set  of  N  DFT  output  samples  is  obtained,  one  at  the  output  of  each  filter  for  every  N  input 
samples  by  sampling  each  filter  output  once  every  N  samples.  Thus,  the  DFT  can  be  imple¬ 
mented  as  a  filter  bank  of  N  channels  with  outputs  that  can  be  resampled  each  at  l/N  the  orig¬ 
inal  rate.  The  sum  of  the  sampling  rates  of  the  channels  is  the  original  sampling  rate.  The 
DFT  and  the  inverse  discrete  Fourier  transform  (IDFT)  form  an  identity  system  in  the  same 
form  as  is  desired  for  the  digital  encoder.  From  Eq.  (1-28),  the  impulse  responses  of  the  filters 
used  in  the  2-point  DFT  are: 

hQ(n)  =  f  ,  n  =  0  ,  t 

=  0  ,  otherwise 

hj(n)  =1  ,  n  =  0 

=  -1  ,  n  =  1 

=  0  ,  otherwise  .  (1-29) 

Comparing  the  DFT  filters  with  the  filters  used  in  the  2-channel  quadrature  mirror-filter  scheme, 
the  DFT  filters  are  (to  within  a  constant  which  is  in  the  IDFT  filters)  the  prototype  filter  and  its 
mirror  filter  needed  for  an  identity  system.  Thus,  the  2-point  DFT  system  is  an  identity- 
quadrature,  mirror-filter  system. 

As  was  determined  in  Section  A,  the  use  of  linear  phase  FIR  filters  of  odd  length  results  in 
a  zero  of  the  system  response  at  w  =  tt/2.  The  basic  quadrature  mirror-filter  relationships  can 
be  modified  so  that  odd-length  filters  may  be  acceptable.  By  changing  the  relationships  of  the 
filters,  the  system  function  is  altered.  The  modification  requires  the  insertion  of  a  delay  in 
one  channel  when  filtering  and  a  corresponding  delay  in  the  other  channel  when  refiltering.  The 
aliasing  components  are  still  cancelled.  The  new  relations  are  described  in  Eq.  (1-30)  along 
with  their  z-transforms: 

hj(n)=h(n)  •< - >  H^z)  =  H(z) 

h2(n)  =  (-l)”'1  h(n  -  1)  « - >  H2(z)  =  z"1  H(-z) 

kj(n)  =  h(n  —  1)  <= - Kf(z)  =  z"1  H(z) 

k2(n)  =  (— l)n  h(n)  « - »  K2(z)=H(-z)  .  (1-30) 

Substituting  these  new  relations  in  the  equation  of  the  output  of  the  basic  block  scheme  of 
Fig.  IV-4,  Eq.  (1-5); 

S(z)  =  j  [H j (z)  K| (z)  +H2(z)  K2(z))  X(z)  +  i  [Hj(— z)  K,(z)  +H2(-z)  K2(z  ]  X(-z) 

=  j  (H(z)  z"1  H(z)  +  z'1  H(— z)  H(-z>)  X(z)  +  j  [H(-z)  z"1  H(z)  +  (-zf 1  H(z)  H(-z)]  X(-z) 
=  j  z‘*  (H2(z)  +  H2(-z)I  X(z)  .  (1-31) 
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As  before,  the  aliasing  terms  have  vanished  leaving  the  linear  filtered  terms  only.  Proceeding 
by  defining  an  impulse  response  and  its  z -transform: 

G(z)  =  3^  =  j  z'1  [H2(z)  +H2(-z)]  .  (1-32) 

This  equation  differs  from  the  system  function  of  the  original,  unmodified  filter  relations, 

Eq.  (1-17),  only  by  a  sign  change;  i.e.,  the  summation  of  the  terms  rather  than  the  difference, 
and  a  delay  of  one  sample. 

Assume  now  that  the  filter,  h(n),  is  a  symmetric  FIR  filter  of  order  N.  The  linear  phase 
term  may  be  factored  out  of  the  transform.  Evaluating  the  system  function  on  the  unit  circle  as 
in  Eq.  (1-22): 

H(w)  =  H'(u>)  exp[-jw(N  -  l)/2] 

G(to)  =  j  exp  [— joi )  {H|2(w )  exp [— jcj(N  -  1))  +  H,2(w  +  tt)  exp[-j(u>  +  ir)  (N  —  1))} 

=  j  exp [ — jco N]  ( H,2(co )  +  (-1)N'‘  H|2(o)  +  Jr))  .  (1-33) 

Since 

H|2(co)  =  H,2(ct)  +  it)  at  (j  =  -j  , 

there  will  be  a  zero  of  the  system  function  at  that  frequency  unless  the  order  of  the  filter,  N,  is 
odd. 

Perhaps  it  is  now  possible  to  find  a  suitable  filter  of  odd  length  that  will  yield  an  identity 
system: 

H(z)=h0+h1z-1  +...  +  hN.1z-‘N‘1) 

where 


hk=hN-l-k  and 

h0  =  hN-l  ^  0  • 

(1-34) 

H2(z)  =  aQ  +  a^z  1  +.  .  . 

-(2N-2) 

+  a2N-2Z 

H2(— z)  =  aQ  —  ajZ  1  +.  .  . 

+  a  z-(2N-2) 

+  a2N- 2Z 

G(z)  =  -|  z'1  (aQ  +  a2z 

-2  +  +a  Z-'2N-2), 

+,,>  a2N-2Z  ] 

where 

an  =h0hn  +hlhn-l  +-’-  +  hnh0  •  d'35) 

But 

a0=h02^°  and  a2N_2 =  h2.,  +  0  . 

To  make  G(z)  as  desired,  all  of  the  terms  must  vanish  except  for  n  =  (N  —  1).  For  h(n)  to  be  a 
length  N  filter,  h(0)  and  H(N  —  1)  can  not  be  zero.  Then,  two  of  the  terms  above  are  nonzero. 
Thus,  there  are  no  filters  of  odd  length  that  would  result  in  an  identity  system. 
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C.  UNEQUAL  BANDWIDTH  FILTER  BANK  DESIGN 

The  three-channel  quadrature  mirror-filter  system  shown  in  Fig.  IV-9  will  result  in  some 
aliasing.  Analysis  of  the  equations  show  the  problems  involved  with  the  aliasing: 


Xj(z)  =  Hj(z)  X(z) 

X2(z)  =  H2(z)  X(z) 

Yj(z)  =  |  [Xjtz1/2)  +Xi(-zi/Z)) 

=  j  [H^z1/2)  X(zl/2)  +H1(-zl/Z)  X(-zl/2)] 

Y2(z)  =  [X2(zl/2)  +  X2(-zl/2)] 

=  y  [H?(zl/2)  X(zl/2)  +  H,(-zl/2)  X(-zl/2)] 


(I- 36a) 
(I- 36b) 

(I- 37a) 

(I- 37b) 


U'j(z)  =  Y'jtz^) 


G(z2)  Yt(z2) 


=  G(z2)  [H^z)  X(z)  +  Hjl-z)  X(-z)) 


(I- 38a) 


U2(z)  =  Y2(z‘) 


=  j  [H2(z)  X(z)  +  H2(-z)  X(-z)] 

Filtering  again  and  summing: 

T'j(z)  =K1(z)  U j(z) 


j  G(z2)  [Kj(z)  Hj(z)  X(z)  H-K^z)  H,(-z)  X(-z)] 


T2(z)  =  K2(z)  U2(z) 


(I- 38b) 


(I-  39a) 


=  j  [K2(z)  H2(z)  X(z)  +K2(z)  H2(-z)  X(-z)) 

S(z)  =  Tjj(z)  +T2(z) 

=  ^  [H,(z)  Kj(z)  G(z2)  +  H2(z)  K2(z)]  X(z) 

+  j  [Hjt-z)  Kj(z)  G(z2)  +  H2(-z)  K2(z))  X(— z) 
=  \  [H2(z)  G(z2)  —  H2(— z)]  X(z) 

+  j  [G(z2)  -  1]  [H(z)  H(—  z ) )  X(— z)  . 


(I- 39b) 


(1-40) 
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The  second  term  in  Eq.  (1-40)  represents  the  aliasing  in  the  output.  As  stated  in  Chapter  IV-4, 
the  aliasing  will  only  appear  in  the  frequency  overlap  of  the  filter  h(n)  and  its  mirror  filter. 
Since 

G(z2)  «  1 

in  the  overlap  region,  the  aliasing  will  be  very  small. 
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APPENDIX  II 

STATISTICAL  ANALYSIS  OF  EXPERIMENTAL  PROCEDURE 


Statistical  analysis  of  the  experiments  to  evaluate  the  performance  of  the  digital  encoding 
system  gives  further  insight  into  the  problem  of  meaningful  evaluation  and  comparison  of  speech 
and  audio  processing  systems.  The  differential  threshold  of  encoding  degradation  was  defined 
in  Chapter  V  as  the  degradation  that  yields  a  7 5 -percent  probability  of  a  correct  response  in  the 
2I2AFC  comparisons.  This  JND  threshold  is  a  mean  probability  of  a  correct  response  averaged 
over  an  ensemble  of  the  entire  population  of  prospective  users.  The  ability  of  a  subject  to  de¬ 
tect  differences  is  not  necessarily  the  same  as  other  subjects  and  can  be  modeled  as  a  sample 
of  a  random  variable.  This  random  variable,  the  probability  of  a  correct  response,  is  assumed 
to  have  a  Gaussian  distribution.  The  standard  deviation  of  the  distribution  is  a  measure  of  how 
much  the  detection  ability  of  each  person  varies  from  the  average.  The  experimental  analysis 
problem  is  twofold: 

(a)  Estimate  the  probability  of  a  correct  response  for  each  subject  via  their 
responses  on  several  trials 

(b)  Estimate  the  average  probability  of  the  ensemble  and  the  variance  of  the 
density  function  from  the  estimates  of  the  individual  subject  probability. 

For  a  given  subject,  subject  m,  estimate  the  probability  of  a  correct  response.  Let  x  be 
the  random  variable  of  the  response  to  a  trial,  0  if  not  correct,  1  if  correct. 


where 


N 

and  let  P  =  ■—  ^  x.  be  the  esi»...:  ‘  of  p 
m  N  1  m 

i=l 


x.  =  0  if  not  correct  on  the  ith  trial 

which  occurs  with  probability  (i  —  P  ) 


x.  =  1  if  correct  on  the  ith  trial  which  occurs 
with  probability  Pm 

N  -  number  of  trials 

P^  =  E  [x]  =  Probability  of  a  correct  response  for  subject  m. 

Expectation  operator  defined  over  the  ensemble 
of  trials  for  a  particular  subject. 


(II-l) 


(II-2) 


Var  [  Pr 


P  (1  -  P) 
m _ m 


(II— 3 ) 


This  estimator  is  an  unbiased  and  efficient  estimator.  If  the  number  of  trials,  N,  is  large,  the 
distribution  of  the  estimate  can  be  approximated  well  by  a  normal  density  function  with  mean 
and  variance  given  above.  Evaluated  for  48  trials  and  a  subject  probability  of  75  percent,  the 
standard  deviation  of  the  estimator  is  6.25  percent. 
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Using  these  estimates  of  the  statistics  for  each  subject  it  is  now  possible  to  estimate  the 
density  function  over  the  population  of  subjects.  Let  P  be  the  random  variable  with  samples 
estimated  above  in  Eq.  (II-l)  and  a  density  which  is  assumed  to  be  normal.  The  number  of 
subjects  is  M.  Assume  that: 

P  =  N(p,  cr2>  . 

Let: 


a2  =  — — — 7  Y  (P  —  a)2  be  estimate  of  the  variance.  (H-6) 

M  —  1  L  m 

m 

The  distribution  of  the  estimator  of  the  mean  is  described  by  Eqs.  (n-7)  and  (II-8).  The  second 
term  in  Eq.  (II-8)  is  due  to  estimating  p  by  the  estimates  of  the  individual  subject  probabilities 
rather  than  the  exact  probabilities: 

E  [(1)  =  p  (n‘7) 


Var  [p]  =  +~T2  Z 


Pm(1-Pm> 


p  (1  ~  fr) 
MN 


(II-8) 


The  results  of  the  experiments  presented  in  Table  V-l  can  now  be  analyzed.  Note  that  it  is 
actually  the  estimates  of  the  individual  sample  probabilities  that  are  used.  The  experiment  with 
five  subjects  using  the  quantization  bit  distribution  parameters  denoted  as  System  E  is: 


(1  =  0.722 


(II-9) 


aZ  =  (0.0559)2 


(11-10) 


Var  (p)  «  (0.0250)2  +  (0.0289)2  =  (0.0382)2 


(11-11) 


For  System  E,  the  estimated  mean  is  72.2  percent.  The  distribution  of  this  estimate  of  the  mean 
is  approximately  Gaussian  with  mean  72.2  percent  and  standard  deviation  of  2.5  percent.  The 
estimate  of  the  variance  for  System  E  gives  a  standard  deviation  of  5.59  percent.  Assuming  the 
normal  distribution,  the  above  estimates  can  be  used  to  make  the  following  claims  for  System  E: 
50  percent  of  the  population  have  a  probability  of  a  correct  response  on  a  trial  of  less  than 
72.2  percent;  69  percent  have  a  probability  of  less  than  75  percent;  84  percent  have  a  proba¬ 
bility  of  less  than  78  percent;  93  percent  have  a  probability  of  less  than  81  percent;  and  98  per¬ 
cent  have  a  probability  of  less  than  84  percent.  Since  the  criterion  for  the  JND  was  set  at 
75  percent  a  priori,  it  is  concluded  that  over  the  population  of  subjects,  31  percent  will  be  able 
to  discern  a  difference  between  unprocessed  speech  and  speech  processed  by  System  E  in  a  test 
like  the  one  performed  here.  Presentation  of  continuous  speech,  however,  is  equivalent  to 
many  trials.  The  probability  of  detection  of  the  encoding  degradation  is  larger  than  for  a  single 
sentence  pair  and  is  a  function  of  the  length  of  the  presentation  of  continuous  speech. 
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LIST  OF  ABBREVIATIONS 


A/D 

ADM 

AD  PCM 

APC 

APCM 

ATC 

CCD 

D/A 

dB 

DFT 

DM 

FFT 

FIR 

IDFT 

HR 

JND 

kbps 

kHz 

PCM 

RMS 

SNR 

SPL 

2I2AFC 


Analog  to  digital 

Adaptive  delta  modulation 

Adaptive  differential  pulse  code  modulation 

Adaptive  predictive  coding 

Adaptive  pulse  code  modulation 

Adaptive  transform  coding 

Charge -coupled  device 

Digital  to  analog 
Decibels 

Discrete  Fourier  transform 
Delta  modulation 

Fast  Fourier  transform 
Finite  impulse  response 

Inverse  discrete  Fourier  transform 
Infinite  impulse  response 

Just  noticeable  difference  (differential  threshold) 

Kilobits  per  second 
Kilohertz 

Pulse  code  modulation 
Root  mean  square 
Signal-to-noise  ratio 

Sound  pressure  level,  re  0.0002  dyne/sq  cm 
Two-interval,  two-alternative  forced  choice 
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